I have 2 dataframes in apache spark.
df 1 has the show number and descriptions... the data looks like
show_no | descrip
a | this is mikey
b | here comes donald
c | mary and george go home
d | mary and george come to town
and the second data frame has the characters
characters
george
donald
mary
minnie
I need to search the the show description one to find out which shows feature which characters...
the final output should look like
character | showscharacterisin
george | c,d
donald | b
mary | c.d
minnie | No show
these data sets are contrived and simple but it expresses the search functionality I am trying to implement. I basically need to search the text of 1 dataframe using the values from another dataframe.
This would be easy to do in a udf inside of sql server, I would basically loop through the show descrip each time and return the show no using a "contains" search on the description.
the problem I have is that I see no way to do this using a data frame.
1) I think you should further breakdown the first dataset so that show_no is mapped to each word in the description.
For e.g first row could be broken down like
show_no | descrip
a | this
a | is
a | mikey
2) You can filter out the stopwords from this if needed.
3) After this you can join it with "characters" to get the final desired output.
Hope this helps.
Amit
Related
It was hard to think of a title for this question, so hopefully that did make sense.
I will explain further. I have a flow of data from an Excel file and each row has one of two words in the last column. It will either contain "Open" or "Current".
So lets say I have an input that looks like this:
NAME | SSN | TYPE
John | 12345| Current
Katy | 99999| Current
Sam | 33333| Current
John | 12345| Open
Cody | 55555| Open
And the goal is grab only a person once. Each person has their unique id as their SSN. I want to grab Open rows if both Open and Current exist for that person. If only Current exists, then grab that.
So the final output should look like this:
NAME | SSN | TYPE
Katy | 99999| Current
Sam | 33333| Current
John | 12345| Open
Cody | 55555| Open
NOTE: As you can see, the first entry for John has been removed since he had an Open row.
I have attempted this already but it is sloppy and I figure there must be a better way. Here is an image of what I have done:
Talend flow
Here's how you can do it:
First sort the data by Name, and Type descending (this is important so that for each person, the Open record is on the top); then in the tMap filter it like this:
Numeric.sequence(row2.name, 1, 1) == 1
Only let the record through if this is the first we're seeing this name.
I have a group of users who each have a variable that assigns them to a group. I can't share the data, but hopefully this example data will prove to be sufficient.
+-----+-----------+--------------+
| ID | Age Group | Location |
+-----+-----------+--------------+
| 1 | 18-34 | East Spain |
| 2 | 35-44 | North China |
| 3 | 35-44 | East China |
| 4 | 65+ | East Congo |
| 5 | 45-54 | North Japan |
| 6 | 0-17 | North Spain |
| 7 | 65+ | North Congo |
| 8 | 45-54 | East Japan |
| 9 | 0-17 | North Spain |
| 10 | 18-34 | East China |
| 11 | 18-34 | North China |
+-----+-----------+--------------+
My end goal is to create a sheet/dashboard, with a pie chart for age grouping. I want to filter this pie chart based on the Area, however, I want there to be two selections, one for Area (East/North), and one for Country (Spain/China/Congo/Japan). The filters will both be "Single Value Lists", so only one Area and one Country will be able to be selected at a time, but together they will combine to filter the patients. For example, if 'East' was chosen for the Area selection, and 'China' for the Country selection, the pie chart would only show for patients 3 and 10.
This helps reduce the number of selections that a user will have from 8, to 6. I know this isn't much of a difference, but in the actual data there are a lot more permutations and so the reduction would really help when de-cluttering the sheet/dashboard.
I've created the parameters for both Area and Country, but I don't know how to combine the two parameters to effect the patients that are selected.
Let me know if I can clarify anything. If parameters aren't the way to do this, I am also open to other suggestions!
Thanks so much!
Why not split the location into two columns, then create filters for each column? Then you have exactly the functionality that you want just using filters without params and calculations
You could then drag Country onto Area in the data pane to tell Tableau there is a hierarchical relationship between the fields, and set the filter for Country to show "only relevant values", and the filter for Area to show "all values in the database" -- via the little black caret menu at the top right of the filter control.
Then the filter control for Country would only display values for the selected Area.
The other advantage this has is that you wouldn't need to maintain a separate list of parameter values. The set of values would be discovered automatically from your data. If areas or countries appear, get renamed or removed from your database, then you'll see that automatically in the filter choices. With parameters, if Korea unifies or the US splits into red USA and blue USA, you'll see that automatically and not risk preventing access to new data simply because your list of parameter values is out of date.
Create a calculated field that concatenates the values from your parameters and tests it against your location field. Then put that calculated field in your filters card and set it to True.
Calculated field should look like this:
([Area] + ' ' + [Country]) = [Location]
I have a table format that appears in this type of format:
email | interest | major | employed |inserttime
jake#example.com | soccer | | true | 12:00
jake#example.com | | CS | true | 12:01
Essentially, this is a survey application and users sometimes hit the back button to add new fields. I later changed the INSERT logic to UPSERT so it just updated the row where email=currentUsersEmail , however for the data inserted prior to this code change there are many duplicate entries for single users. i have tried some group by's with no luck, as it continually says the
ID column must appear in the GROUP BY clause or be used in an
aggregate function.
Certainly there will be edge cases where there may be clashing data, in this case maybe user entered true for the employed column and then in the second he/she could have enter false. For now I am not going to take this into account yet.
I simply want to merge or flatten these values into a single row, in this case it would look like:
email | interest | major | employed |inserttime
jake#example.com | soccer | CS | true | 12:01
I am guessing I would take the most recent inserttime. I have been writing the web application in scala/play, but for this task i think probably using a language like python might be easier if i cannot do it directly through psql.
You can GROUP BY and flatten using MAX():
SELECT email, MAX(interest) AS interest,
MAX(major) AS major,MAX(employed) AS employed,
MAX(inserttime) AS inserttime
FROM your_table
GROUP BY email
I have a details table of posts and subjects digged from a forum. Row is the single subject (ie postID and subjectIS is the primary key for the table), then I have some measures at subject level and some at post level. For example:
+---------+-------------+--------------+------------+--------------+--------+
| post.ID | post.Author | post.Replies | subject.ID | subject.Rank | year |
+---------+-------------+--------------+------------+--------------+--------+
| 1 | mike | 10 | movie | 4 | 1990 |
| 1 | mike | 10 | comics | 6 | 1990 |
| 2 | sarah | 0 | tv | 10 | 2001 |
| 3 | tom | 4 | tv | 10 | 2003 |
| 3 | tom | 4 | comics | 6 | 2003 |
| 4 | mike | 1 | movie | 4 | 2008 |
+---------+-------------+--------------+------------+--------------+--------+
I want to study the trend of posts and subjects by year and color it by subject.Rank.
Firsts are easily measured putting COUNTD(post.ID) and COUNTD(subject.ID) in rows and 'year' in column.
But if I drag MEDIAN(subject.Rank) in Color, I got a wrong result: it's not calculated at distinct subject.ID level but at row level.
I think I can accomplish it using table calculation features, but I have no idea on how to proceed.
It sounds like you are trying to treat Subject.Rank as a dimension, instead of as a measure. If so, just convert it to a dimension on the worksheet in question by right clicking on the field and choosing dimension. You can also convert it to a dimension in the data pane by dragging the field from the measures section up to the dimensions section. That will tell Tableau to treat that field as a dimension by default in the future.
A field can be treated a dimension in some cases, and a measure in others. Depends on what you are trying to achieve. If you are familiar with SQL, dimensions are used to partition data rows for aggregation using the GROUP BY clause.
Finally, count distinct (COUNTD) can be expensive on large datasets. Often, you can get the same result another way. So try to think of other approaches and save COUNTD for when you really need it.
Try using {fixed [1st LEVEL],[2nd level]: median()}
or
Table calculation approach
when you put in median there is an edit table calculation under advance compute using put you fields in there(Make sure its ordered the way you want it to calculate when you select them) then click OK select the at which level and restart every
I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob