What methodology to use with Text and bounding Boxes? - classification

I have data composed of each word in a pdf document (For instance a bail) as text, its bounding box (x_min, x_max, y_min, y_max), document_id, page number, block number, line number.
I annotate manually some words with labels for instance the country name, and the client name. The other words I gave them as class "other" which dominates the other labels that I added manually. For instance:
I have 5 documents and their data and labels that are distributed as follows:
City: 5 labels
Country: 5 labels
Name: 6 labels
Company: 6 labels
Other: 180 labels
What type of algorithms should I use on a new file to find the labels that matter? also Do you have an idea how can I find the amount of data needed to have a robust model?
I tried to use the following models:
Xgbbost
Light gbm
Random forest
But thy are not robust

Related

Calculate areas of new features in merged layer in QGIS

I have merged four different layers into one new one in QGIS, but I want this layer to have different information then the old layers. I want all the buffered 'islands' to have a different ID and a calculated area. However, now in the attribute table I just see four features, one for each layer that I merged. Is there a way to update the attribute table to consist of new features (one for each 'island')?
This is what the layer looks like:
And this is what the attribute table now looks like:
And this is what I want (the 5th and 6th column especially):
You must create a feature for each monopart geometry, you can achieve this using the 25.1.18.46. Multipart to singleparts tool, and then use the field calculator to get the area, you can find here how to calculate area Calculating polygon areas in shapefile using QGIS.

In Tableau Map plot multiple sites at same lat long

I have a data where at one latitude and longitude multiple shops are located.
For Example.
Latitude Longitude ShopId Type
6.24458 50.001756 101 Saloon
6.24458 50.001756 102 Groceory
6.24458 50.001756 103 Pharmacy
6.24458 50.001756 104 FishMarket
When on map I am plotting using above latitude & longitude I am getting single mark. And when I hover the mark I am getting single shop details but I want 4 marks and on each mark it should show respective shopid and Type.
I am new to Tableau and not able to figure out how to do it.
You are likely getting 4 marks displayed at the same location. So when you click on the mark you see, then you are only selecting the top mark. You can verify this by dragging over the mark to select all the marks within a selection rectangle. If you then, right click and view data, you should see all 4 marks.
Another thing that can help when you have overlapping marks, is to make the marks partially transparent and add a border around the marks. Both options are available by clicking on the Color button on the marks card to get to the advanced color settings.
If this is not the behavior you want, you have a couple of options. One easy approach is to add a little random noise to each latitude and longitude (called jitter). Adding a little jitter makes the marks visible, although the size of the jittering depends on your data and scale. Jittering is especially useful if all your points are geocoded to the same situation - say if every building with a Los Angeles address is treated as if it is located at city hall. In that case, the geocoding distorts the data to a degree that jittering is just fine.
The undocumented RANDOM() function is an easy way to add some jitter. Excel and Hyper Extracts support RANDOM() among other data source types. It returns a number between 0 and 1.
The other options involve treating your coordinates as continuous dimensions instead of measures, and then using some other visual attribute size, color etc to indicate the number of items at each location. It is often useful to combine nearby items with some sort of grid or hex bin function -- In this case, instead of adding random noise to each coordinate, you round or truncate it in someway to effectively snap points to a grid. The ROUND() and HEXBINX() HEXBINY() functions are useful here. When using this approach, be sure your packed coordinate fields are continuous dimensions and have the appropriate Latitude or Longitude geographic role.
Finally, take a look at the density mark type. It can make visual heat maps, either working with exact data points or grid packed points.

Highlight multiple row values based on another value in tableau

I have a table with 4 columns grouped by first 2 columns. Something like below
Company Department Department Rating Org Org Rating Emp Type Employee
A Sales 1 XX 2 External John
Ops 2 XX 1 Hybrid Mike
B HR 1 YY 2 Internal Richard
Dev 3 ZZ 3 Internal Julie
I want to highlight Department and Org based on values in ratings column (1- yellow, 2- red, 3-blue)
Could anyone guide on how to make it happen on tableau?
Tableau by default will only highlight measure values so in order to highlight dimensions you need to trick it a bit.
First, you'll want to create two dummy columns with the formula min(1). You can either create this through a calculated field or just double click and type them in the columns shelf.
In the Marks card for each of the two new measures (not the All card), place Department on the text field in one and Org in the text field of the other. Set the Mark type to be Text on both of them.
Create a calculated field to define the colors. The actual values don't matter here as you'll assign a color to them latter. Put this on the All marks card on the Color Shelf.
At this point, you might have something like...
Now to finish it off you can do a few things:
Double click on the color squares in the legend to reassign them
Change formatting on the measure to get rid of the axis, zero lines, etc
Move the label from the bottom to the top by creating a dual axis.

How do you heatmap in tavleau a list of numbers based on another column data

I have data from 48 countries. I am trying to visualize it on a map. I want to display half the countries in 1 color and the other half in another color. This segmentation is based on another column which has string value 'yes' or 'no'. I want to do it on tableau
Country data OFF
------------------------
US 100,000 yes
IN 200,000 yes
BR 300,000 no
MX 150,000 no
I want to plot US, IN in Blue and BR, MX in green. The shades of green and blue are dependent on the values of data.
Have you tried to drag the OFF field to Color in the sheet? This should do the trick
You can put either the measure data on the color shelf to color by value, or the dimension OFF on the color shelf to color by OFF -- i.e. to use color to encode a single field's value.
If you want to color by two dimension fields, you can get the shading effect you mention by putting both on the color shelf -- using the shift key to add the second one if it's not in a hierarchy with the first one.
If you want to color by both a dimension and a measure (as in your case), you have to go to a little more effort -- assuming its worth it. You can make a calculated field that uses both fields to map into a category (say returning a string like "off yes, data high" or "off yes, data mid") and then edit the colors to map them as you like.

matlab create title for each boxplot

Here is a problem:
I have a boxplot but there is no easy way to name each of these small boxes.
Here is an example:
boxplot(rand(10,3))
will draw 3 of this boxes, but will the title for each of these boxes are 1, 2, 3 and I need some more meaningful things.
I have one idea, how to achieve it,
load carsmall
boxplot(MPG,Origin)
But this require to restructure my data and create additional columns with titles.
Does
boxplot(rand(10,3), 'labels', {'a','b','c'})
do what you need?