CodeX: Get Table Query Specification

Get Table Query Specification

With a single source reference (e.g. URL pointing at an HTML page), it is possible for xSkrape to identify potentially multiple tables of data. Each function call, however, must deal with a single table. The table query specification is a way to identify the best match for a table using declarative syntax: you describe what it is you want, not necessarily how to get it (xSkrape handles that).

Each query specification is composed of a criteria string, followed by an equal sign, followed by some text that is specific to the criteria being applied. Any of the criteria listed below can be used, where if more than one is required, they can be combined using the "&&" symbols. (Those familiar with C# will recognize that as a "logical and" - which is what combining the terms effectively is.) Because some kinds of change in the data source could invalidate use of specific filters, combining more than one can be considered beneficial, even if it seems redundant. Ultimately, if there are multiple tables left matching your criteria, the first one is selected.

Criteria	Overview
tablenumber=n	n must be an integer greater than zero and represents the position-based index of the table, in the source. This can be observed when using Page Explorer, based on the "tab number" shown for "potential table matches." This is a simple way to identify a table of interest, but it could become incorrect if changes in the source structure changes the position of candidate tables.
columnname=textualname	textualname must be a column name in the table of interest. Page Explorer shows resolved column names for tables and makes a suggestion to use this criteria when it would uniquely identify the table in the source. A change in the column name in the source data could cause this filter to no longer operate as expected.
columnnameregex=regular_expression	When regular_expression is applied as a regular expression over all column names in tables in the source data, only tables where there is a match are considered. This filter allows you to support "non-exact" matches, offering some measure of flexibility in the event you expect the source column names might change in a well-defined way.
isnumericonly=true\|false	If this is set to "true", requires that at least one column in the candidate table be a numeric type (integer or floating-point).
isallowsblank=true\|false	This filter is applied as part of a series of column-level filters, where there must be at least one column that meets the criteria provided for the table to be considered a match. In this case, "false" causes columns that contain any "blank data" to be removed from consideration.
valueregex=regular_expression	This filter is applied as part of a series of column-level filters, where there must be at least one column that meets the criteria provided for the table to be considered a match. In this case, the regular expression provided must match for all rows for the column being considered. For example, "\d*" as a regular expression requires that at least one column in the candidate table must be blank or have only digits.
musthaverowfilter=filter_expression	When specified, the filter expression must match at least one row of data in the candidate table, or the table is excluded as a candidate. The expression format is effectively the same as used in function parameters such as "tableValueFilter" and is documented thoroughly here.
musthaveandrowfilter=filter_expression	Similar to musthaverowfilter, except the row filter is also applied to the actual returned table. This functionality is largely superceded by the "tableValueFilter" parameter on some functions, but it does serve as an option for when this is not available. (This serves as both a filter and a transformation, described below.)
jsonroot=fieldname	Locates the field name in the text. The format should be fieldname: {...} which {...} taken together comprises a valid JSON object. The object text is returns for subsequent processing.
rangerowstart=int	Optionally provide a parsing hint for what row number to start looking for tabular data.
rangerowend=int	Optionally provide a parsing hint for what row number to stop looking for tabular data.
rangecolumnstart=int	Optionally provide a parsing hint for what column number to start looking for tabular data.
rangecolumnend=int	Optionally provide a parsing hint for what column number to stop looking for tabular data.

The table query string can serve a second purpose: it can apply some types of transformations and transformation hints over the matched source data table, once established. For example, xSkrape's logic that tries to identify tabular data may have some problems finding the expected column names in the source data, but a hints such as "AllowedBlankHeaderColumns" or "SkipRowCount" may help find what you expect. Expect to see more options in the future, although our goal is to make the product intelligent enough to avoid needing to use hints as much as possible.

Transform / Hint	Overview
rowfilter=filter_expression	Similar to musthaverowfilter, except the row filter is only applied to the actual returned table and does not participate in the inclusion / exclusion decisions about the candidate table (unless the expression is invalid for a table, in which case the table is not considered a candidate).
allowedblankheadercolumns=ColumnName1 [, ColumnName2 ... , ColumnNameN]	Normal parsing logic assumes column headings will exist for all tabular columns, and as such, any "gap" in column naming causes the entire row to be skipped as a column heading candidate. However, this option lets you specify one or more column names that you would prefer to use in place of any blanks encountered. For example, a row with three blanks could have those blanks replaced with "Col1", "Col2" and "Col3" if you used: "AllowedBlankHeaderColumns=Col1,Col2,Col3".
skiprowcount=n	n is an integer greater than zero that specifies a fixed number of rows that you wish to ignore at the start of the source data. (Normal parsing logic can usually determine the proper value to use by default, but this option precedes these intelligent parsing rules.)
mergecolumndown=n	n is an integer greater than zero that specifies the maximum number of header columns that may be carried down and concatenated with the following row to form the "actual header" row. This is useful for source data that looks like this for example: Here with "MergeColumnDown=1", we get column names of "Group A - Price", "Group A - Quantity", "Group B - Price", "Group B - Quantity", etc.
onlycolumns=cn1[,cn2,...]	Optionally restrict the resulting table to a subset of columns (listed in a comma-separated list).
excludecolumns=cn1[,cn2,...]	Optionally exclude specific columns from the resulting table (listed in a comma-separated list).
copywhenblank=cn1[,cn2,...]	Optionally list one or more columns where is the value is found to be blank, populate it with the last non-blank value found for it. This can be used to normalize grids that have suppressed repeating values.

Too little detail	Too confusing / unclear	Needs better visuals
Spelling / grammar issues	Bugs / broken links	Needs to be more prominent!
Looks good to me!	Page looks (other):