Microsoft MDX tutorial: source structure column not found - tsql

I am in the process of learning about Microsoft's Analysis Services for SQL Server 2012. Having gone through the basic AdventureWorks tutorial using the VisualStudio 2012 plugins, I am now trying my hand at executing these same procedures using the DMX Query language.
The tutorial I am refering to can be found here:
http://msdn.microsoft.com/en-us/library/4b634cc1-86dc-42ec-9804-a19292fe8448(v=sql.110)
The Database project I am using in the Analysis Server is that of the BasicDataMining project as can be found upon the completion of the BasicDataMining tutorial here:
http://msdn.microsoft.com/en-us/library/ms167167(v=sql.110).aspx
The first lesson in the DMX tutorial goes well having used the following query to create the Mining Structure:
CREATE MINING STRUCTURE [Bike Buyer]
(
[Customer Key] LONG KEY,
[Age]LONG DISCRETIZED(Automatic,10),
[Bike Buyer] LONG DISCRETE,
[Commute Distance] TEXT DISCRETE,
[Education] TEXT DISCRETE,
[Gender] TEXT DISCRETE,
[House Owner Flag] TEXT DISCRETE,
[Marital Status] TEXT DISCRETE,
[Number Cars Owned]LONG DISCRETE,
[Number Children At Home]LONG DISCRETE,
[Occupation] TEXT DISCRETE,
[Region] TEXT DISCRETE,
[Total Children]LONG DISCRETE,
[Yearly Income] DOUBLE CONTINUOUS
)
WITH HOLDOUT (30 PERCENT or 1000 CASES)
When I arrive at the second lesson however the suggested query does not work:
ALTER MINING STRUCTURE [Bike Buyer]
ADD MINING MODEL [Decision Tree]
(
CustomerKey,
[Age],
[Bike Buyer] PREDICT,
[Commute Distance],
[Education],
[Gender],
[House Owner Flag],
[Marital Status],
[Number Cars Owned],
[Number Children At Home],
[Occupation],
[Region],
[Total Children],
[Yearly Income]
) USING Microsoft_Decision_Trees
WITH DRILLTHROUGH
Providing the following error:
Error (Data mining): The 'CustomerKey' source structure column for the 'CustomerKey' mining model column was not found.
I am puzzled as how to solve this and why literal queries from Microsoft's own tutorial do not seem to work.
I would greatly appreciate any insights that can help me resolve this problem and proceed with this tutorial.

ALTER MINING STRUCTURE [Bike Buyer]
ADD MINING MODEL [Decision Tree]
(
**[Customer Key]**, <-- Missing space was the issue
[Age],
[Bike Buyer] PREDICT,
[Commute Distance],
[Education],
[Gender],
[House Owner Flag],
[Marital Status],
[Number Cars Owned],
[Number Children At Home],
[Occupation],
[Region],
[Total Children],
[Yearly Income]
) USING Microsoft_Decision_Trees
WITH DRILLTHROUGH

Related

Errors converting Geometry to Geography

I am getting an error trying to convert data from a Geometry field to a geography field in a separate table.
INSERT INTO PIGeoData
([ID], [geo_name], [geo_wkt] ,[port_geography_binary] )
SELECT [id], [name] ,[wkt], GEOGRAPHY::STGeomFromWKB(em_ports.geom.STAsBinary(),4326)
FROM [guest].[em_ports]
where ID < 4548 and ID not in (select ID from PIGeoData)
The error I get is this
Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "geography":
Microsoft.SqlServer.Types.GLArgumentException: 24205: The specified input does not represent a valid geography instance because it exceeds a single hemisphere. Each geography instance must fit inside a single hemisphere. A common reason for this error is that a polygon has the wrong ring orientation. To create a larger than hemisphere geography instance, upgrade the version of SQL Server and change the database compatibility level to at least 110.
Microsoft.SqlServer.Types.GLArgumentException:
at Microsoft.SqlServer.Types.GLNativeMethods.ThrowExceptionForHr(GL_HResult errorCode)
at Microsoft.SqlServer.Types.GLNativeMethods.GeodeticIsValid(GeoData& g, Double eccentricity, Boolean forceKatmai)
at Microsoft.SqlServer.Types.SqlGeography.IsValidExpensive(Boolean forceKatmai)
at Microsoft.SqlServer.Types.SqlGeography..ctor(GeoData g, Int32 srid)
at Microsoft.SqlServer.Types.SqlGeography.GeographyFromBinary(OpenGisType type, SqlBytes wkbGeography, Int32 srid)
I get the same message if I try to convert from WKT using
,GEOGRAPHY::STGeomFromText(wkt,4326)
Both these formats come from the MS documentation here
But if I copy the polygon data from the wkt and paste it into a query like this
declare #sGeo geography
declare #sWKT varchar(max)
select #sWKT = wkt from guest.em_ports where wkt like '%POLYGON ((73.50667 4.181667,73.50667 4.21,73.48 4.21,73.48 4.1783333,73.50667 4.181667,73.50667 4.181667))%'
set #sGeo = geography::STPolyFromText (#sWKT, 4326 )
Update PIGeoData
Set PortBoundaries = #sGeo
Where wkt like '%POLYGON ((73.50667 4.181667,73.50667 4.21,73.48 4.21,73.48 4.1783333,73.50667 4.181667,73.50667 4.181667))%'
that works.
So I moved all the non-geo data to the new table and started going through record by record to see which WKT was failing:
I used this query
Update PIGeoData
Set port_geography_binary = GEOGRAPHY::STGeomFromText(geo_wkt,4326)
where port_geography_binary is null and ID = <xyz>
where xyz was individual record ids
These WKT values succeeded
POLYGON ((-135.31197 59.451653,-135.32457 59.45799,-135.32996 59.454834,-135.36717 59.455154,-135.36452 59.449005,-135.36488 59.43996,-135.36697 59.43817,-135.33139 59.438065,-135.31197 59.451653,-135.31197 59.451653))
POLYGON ((-4.524549 48.365623,-4.518855 48.361416,-4.4854136 48.367413,-4.436236 48.381382,-4.420772 48.39644,-4.431077 48.398525,-4.4376454 48.393867,-4.438626 48.38611,-4.4559207 48.390007,-4.470995 48.387226,-4.4933248 48.384468,-4.499816 48.38401,-4.512855 48.3754,-4.524549 48.365623,-4.524549 48.365623))
These WKT values failed
POLYGON ((-8.788489 37.773106,-8.989748 37.785244,-9.11148 37.93065,-9.01401 38.13953,-8.993956 38.30128,-9.266149 38.264282,-9.382366 38.33244,-9.435615 38.54836,-9.656681 38.602306,-9.683701 38.883057,-9.1720295 39.00796,-8.444215 39.550682,-8.213643 39.355015,-8.537656 38.037514,-8.712016 37.782127,-8.788489 37.773106))
POLYGON ((-119.71587 34.396824,-119.69837 34.410378,-119.67453 34.41837,-119.62994 34.420082,-119.63012 34.380177,-119.62986 34.3551,-119.71534 34.355022,-119.71587 34.396824,-119.71587 34.396824))
There is nothing obvious to me in the data. Can anyone help with why these records and data are failing?
TIA
The relevant part of the error message is "A common reason for this error is that a polygon has the wrong ring orientation."
The polygons that have failed are in clockwise order.
To convert them to counter-clockwise order, you can use something like this:
DECLARE #t VARCHAR(MAX)='POLYGON ((-119.71587 34.396824,-119.69837 34.410378,-119.67453 34.41837,-119.62994 34.420082,-119.63012 34.380177,-119.62986 34.3551,-119.71534 34.355022,-119.71587 34.396824,-119.71587 34.396824))'
DECLARE #x XML=REPLACE(REPLACE(REPLACE(#t,'POLYGON ((','<root><p>'),'))','</p></root>'),',','</p><p>')
DECLARE #r VARCHAR(MAX)='POLYGON (('+STUFF((
SELECT ','+q.Point
FROM (
SELECT n.value('.','varchar(50)') AS Point, ROW_NUMBER() OVER (ORDER BY t.n) AS Position
FROM #x.nodes('/root/p') t(n)
) q ORDER BY q.Position DESC
FOR XML PATH(''), TYPE
).value('.','VARCHAR(MAX)'),1,1,'')+'))'
DECLARE #g GEOGRAPHY=GEOGRAPHY::STGeomFromText(#r,4326)
SELECT #g, #g.ToString()
Later edit:
There is a convention that says that a polygon should always be represented in counter-clockwise order. Imagine that you have a polygon in the shape of the equator; without this convention it would not be clear if the polygon represents the northern hemisphere or the southern hemisphere. See Spatial Data Types Overview in the Microsoft SQL Server documentation for details.
Additionally, there is a limitation in SQL Server when the compatibility level is 100 or below that each geography instance must fit inside a single hemisphere. If you are using SQL Server 2012 or later and you choose to use at least compatibility level 110, you can avoid the error message, but the polygon would represent the entire area that is outside of what you would normally think that the polygon represents.
If you use compatibility level is 100 or below, you could use a TRY/CATCH to detect the error and if it happens you should try reversing the polygon.
If you use compatibility level 110 or later, you can try to use STArea() to check if the polygon has a surface which is much bigger or much smaller than one hemisphere. If the area approaches 510100000000000 square meters (which approximately the area of the entire earth) then you should reverse the polygon.

Problems with spatial join

I am new to sql, and attempting to use it to speed up spatial analysis on a set of ~1.2 million trips from a csv that contains the lat and lon for pickup and dropoff points.
What I am trying to do in plain English is:
select all trips that start in the area of interest (loaded into my database as a shapefile) into one table
select all trips that end in the area of interest into another
-perform a spatial join between these points and a shapefile of census tracks (which contains neighborhood names)
count by neighborhood name to list the most frequent origins/destination of trips to/ from the area of interest.
The code I am working with is below (If its helpful, NTA or neighborhood tabulation area, is the neighborhood name which I want to display in my table at the end of this operation) :
--Select all trips that end in project area
SELECT *
INTO end_PA
FROM trips, projarea
WHERE ST_Intersects(trips.dropoff, projarea.geom);
--for trips that end in project area - index by NTA of pick up point
ALTER TABLE end_PA ADD COLUMN GID SERIAL;
CREATE TABLE points_ct_end AS
SELECT nyct2010.ntacode as ct_nta, end_PA.gid as point_id
from nyct2010, end_PA WHERE ST_Intersects(nyct2010.geom , end_PA.pickup);
--Count most common NTA
--return count for each NAT as a csv
copy(
select count(ct_nta) from points_ct_end
group by ct_nta
order by count desc)
to 'C://TaxiData//Analysis//Trips_Arriving_LM.csv' DELIMITER ',' CSV HEADER;
However, I am having problems from the very start - ST_Intersects does not return any points within the area of interest!
Troubleshooting solutions I have tried thus far:
My first thought is that the points weren't in the correct SRID. When I created the 'dropoff' point I set the SRID to 4326. I tried both using ST_SetSRID to change the projection of both data sets to 4326, and manually re projecting the shapefiles to 4326 in ArcMap - but neither worked.
I plotted a small sample of the points from the 'trips' data set in Arc Map to ensure they were correctly projected and overlapping with the ProjArea shapefile. They are.
I imported the multipoint shapefile this created into my geo database to test if that worked with ST_Intersects. Nope.
I tried using ST_Within. This threw the error message:
ERROR: function st_within(character varying, geometry) does not exist
....
HINT: No function matches the given name and argument types. You
might need to add explicit type casts.
I am using Big SQL and postgres
Thanks!!
My first thought is that the points weren't in the correct SRID. When I created the 'dropoff' point I set the SRID to 4326. I tried both using ST_SetSRID to change the projection of both data sets to 4326, and manually re projecting the shapefiles to 4326 in ArcMap - but neither worked.
ST_SetSRID doesn't change the projection (reproject). It just changes the internal representation. This can totally screw everything up if the previous SRID matched the input data. You likely wanted ST_Transform().
There isn't enough information here to trouble shoot this problem. However, we can answer this...
ERROR: function st_within(character varying, geometry) does not exist
This simply means the first argument is not a geometery. Of course, we can't do anything with that at all because we don't have your query that you tried with ST_Within().
Your syntax for ST_Intersects() looks to be right. But, there simply isn't enough information provided to help. Show some schema and sample data.

postgresql phrase extraction & ranking

From selected rows in a table, how can one extract and rank phrases based on how often they occur?
example 1: http://developer.yahoo.com/search/content/V1/termExtraction.html
example 2: http://mirror.me/i/love
INPUT:
CREATE TABLE phrases (
id BIGSERIAL,
phrase VARCHAR(10000)
);
INSERT INTO phrases (phrase) VALUES (‘Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.’)
INSERT INTO phrases (phrase) VALUES (‘Andrea Bolgi was an italian sculptor’)
DESIRED OUTPUT:
phrase | weight
italian sculptor | 5
virgin mary | 2
painters | 1
renaissance | 1
inspiration | 1
Andrea Bolgi | 1
To find just words, not phrases, one could use
SELECT * FROM ts_stat('SELECT to_tsvector(''simple'', phrase) FROM phrases')
ORDER BY nentry DESC, ndoc DESC, word;
Some notes:
phrases could contain “stop words”, e.g. “easy to answer”
ideally, english language variations and synonyms would be automatically grouped.
Could pg_trgm help? (it’s ok if only 2 and 3 word phrases are found). How exactly?
Related questions:
What techniques/tools are there for discovering common phrases in chunks of text?
How to find common phrases in a large body of text
How to extract common / significant phrases from a series of text entries
I agree with Craig that this is certainly way beyond the scope of what tsearch2 was intended to do as well as any other existing PostgreSQL tools. However, I do think that this might not be too bad to do in the db engine. One of the strengths of PostgreSQL is programmability and this strength gives you some very underutilized options.
As Craig notes, this is the domain of natural language processing, not of SQL per se, so the first thing you want to do is settle on a natural language processing toolkit with support for a stored procedure language that PostgreSQL supports. In other words, you want something that supports Perl, Python, C, etc. Whatever PostgreSQL supports and you feel comfortable working in.
The second step is to create functional interfaces for this toolkit in stored procedure languages. This should take text in, and output the phrase breakdown in some sort of type PostgreSQL can handle reasonably well. You need to pay attention to the type carefully because that affects things like GIN indexing.
From there you can incorporate it into your database interfaces and queries.

FileMaker database design with calculated fields and filtering

I am trying out Filemaker Pro 12 right now with no previous FM experience, although other basic DB experience. The issue I have is trying to do filtered queries for a report that span one-to-many relationships. Here is an example;
The 2 tables:
Sample_Replicate
PK
Sample FK
other fields
Weights
Sample_Replicate_FK (linked to PK of Sample_Replicate)
Weight
Measurement type (tare, gross, dry, ash)
Wash type (null or from list of lab assays)
I want to create a report that displays: (gross-tare), (dry-tare)/(gross-tare), (ash-tare)/(gross-tare), and (dry-tare)/(gross-tare) for all dry weights with non null wash types.
It seems that FM wants me to create columns for each of these values (which is doable as the list of lab assays changes minimally and updating the database would be acceptable, though not preferred). I have tried to add a gross wt, tare wt, etc to the Sample_Replicate table, but it only is returning the first record (tare wt) when I use calculated field and method:
tare wt field = Case ( Weights::Measurement type = "Tare"; Weights::Weights )
gross wt field = Case ( Weights::Measurement type = "Gross"; Weights::Weights )
etc...
It also seems to be failing when I add the criteria:
and Is Empty(Weights::Wash type )
Could someone point me in the right direction on this issue. Thanks
EDIT:
I came across this: http://www.filemakertoday.com/com/showthread.php/14084-Calculation-based-on-1-to-many-relationship
It seems that I can create ~15 calculated fields for each combination of measurement and wash type on the weights table, then do a sum of these columns in the sample_replicate after adding these 15 columns to the table. This seems absolutely asinine. Isn't there a better way to filter results of a one-to-many relationship in FM?
What about the following structure:
Replicate
ID
Wash Weight
Replicate ID
Type (null or from list of lab assays)
Tare
Gross
Dry
Ash
+ calculated fields
I assume you only calculate weight ratios of the same wash type. The weight types (tare, gross, etc.) are not just labels here; since you use them in formulas in specific places, they are more like roles, so I think they deserve their own fields.
add tare wt field, etc. in the Weights table but then add a calc field in your Sample_Replicate table to get the sum of all related values
ex: add field "total tare wt" to be "sum ( Weights::tare wt)"

Joing several MDX query results in a single report

I use MS SQL Server 2008 R2.
I've got the problem, please, excuse the long explanation.
We've got the SSAS cube. It is under development at this time, but it is partially working and can be accessed through excel.
There are projects: hierarchycal parent-child dimension
There are resources assigned to the project (e.g. man-hours, building materials, technic): dimension with resource types, fact M2M table ProjectId-ResourceId-UnitsCount-Cost
There are milestones for the projects: dimension with milestone types (few are defined), M2M fact table: ProjectId-MilestoneId-...milestone dates: planned/actual start/finish
This is a simplified schema.
I need to create a MS Reporting Services report with the following columns:
Project Hierarchy
several columnts with the pre-defined and "hardcoded" resource type amount. e.g the business wants to see the columnt with man-hours spent, and concrete consumption in cub-meters. Thse two clauses can be hardcoded in the query.
several columns with the pre-defined and "hardcoded" milestone type dates
this is a simplified schema too, more columns with other dimension slices are needed...
The problem is that i cannot find an elegant way to create this report.
in my current version, i have to create 2 datasets and query the resouce and milestone data in separate mdx queries.
then i need to use RS-lookup function to join the data in report outcome.
Please acvise:
is there a possibility to query this data in an single mdx query. when i try something like this:
union({{[Dim Resource].[Measure].[man-hour]} + {[Dim Resource].[Measure].[cub-meter]}},
{[Dim Milestone].[Milestone Type].[ProjectStart]}) i've got "different dimensionality" error. Any workarounds?
if i need to output a formatted value like: "X 'man-hour' / Y 'cub-meter'", i have to use lookup func to get both parts of the formula - any better way?
can i query this data any other way?
Please, indicate the direction of googling
or... should i just query the data from the source tables (this is allowed by security restrictions) with SQL
thank you in advance
Perhaps create a new 'virtual cube' to contain data from both of your existing cubes, then query that one.