Function: WebGetTable

Extract tabular data (rows and columns) from a source reference, using query conditions to identify the characteristics of the data in the event there are multiple tables that could be identified in the source.

Usage

  • Excel (Online)
  • Excel (Desktop)
    =WebGetTable(string url
    , string tableCriteriaQuery
    , bool includeHeaders = false
    , string headerRenaming = ""
    , string additionalConfig = ""
    , object noDataValue = null
    , string tableValueFilter = ""
    , string tableValueSort = "")
  • SQL Server
    WebGetTable(@serviceUrl nvarchar(1000)
    , @parseUrl nvarchar(MAX)
    , @tableCriteriaQuery nvarchar(MAX)
    , @additionalConfig nvarchar(2000))

Parameters

  • Excel
    • Target
      Names the specific cell that will be the top-left anchor of the retrieved table, or the target can be a cell range that defines the maximum extent of the returned table. Only applies to Excel "Web" Add-in.
    • url
      A reference to the source data to be queried. Can be a reference that is HTTP, HTTPS or FILE based. Can optionally be an expression that supports loading data across multiple requests (or performs other special parsing), as described here. Required. Excel "Web" settings.
    • tableCriteriaQuery
      A query specification to isolate a specific table from the raw source data. Details about the query syntax are covered here. If there are multiple tables that meet the given criteria, the first table is used. Optional. If this is omitted, uses "tableindex=1" as the default value. Excel "Web" settings.
    • includeHeaders
      If True, table header names are included in the returned data feed. Optional - default behavior is no table headers (just raw contents).
    • headerRenaming
      An optional set of column renaming expressions of the format "OldColumnName becomes NewColumnName", delimited by "&&" if there are multiple. (See examples.) An empty string or omitted implies header names will be as they are in the data source.
    • additionalConfig
      An optional list of configuration parameters that control how the query is issued. Details about the configuration parameters syntax are covered here. Optional - provide an empty string or omit the parameter if there are no configuration parameters needed (i.e. default behavior).
    • noDataValue
      When no data value is retrieved, this value is returned as a "default" for all fields in the range. This value also plays a role depending on your "ErrorHandling" configuration setting. Optional - no value is treated as an empty string.
    • tableValueFilter
      A filter expression that can be applied to the table to help isolate a candidate row or rows of interest. The format for the expression is more or less the same as described here. Optional. Excel "Web" settings.
    • tableValueSort
      A string that contains the column name followed by "ASC" (ascending) or "DESC" (descending). Columns are sorted ascending by default. Multiple columns can be separated by commas. Optional.
  • SQL Server
    • @serviceUrl
      A URL reference to your SQL-Hero application server services. Typically this would be "http://servername:46837" where "servername" matches the server on which you installed the SQL-Hero services, with port 46837 being the default port number used by the SQL-Hero installer program.
    • @parseUrl
      A reference to the source data to be queried. Can be a reference that is HTTP, HTTPS or FILE based. Can optionally be an expression that supports loading data across multiple requests, as described here.
    • @tableCriteriaQuery
      A query specification to isolate a specific table from the raw source data. Details about the query syntax are covered here. If there are multiple tables that meet the given criteria, the first table is used.
    • @additionalConfig
      An optional list of configuration parameters that control how the query is issued. Details about the configuration parameters syntax are covered here. Provide an empty string or NULL if there are no configuration parameters needed (i.e. default behavior).

Return Value(s)

  • Excel
    • Returns values of an appropriate type. For example, a value that looks like a number will be returned as a numeric (i.e. can be formatted as a number). Populates a range of cells. Can be the same size as the source data, or a subset, or a superset (if a subset, takes from the top left-most fields; if a superset, additional rows and columns populate based on the noDataValue parameter).
    • Values that can't be parsed to a native type are left as strings.
    • No data matching the query expression returns the value specified in the noDataValue parameter (or an empty string if no value is provided) for all cells of the range.
    • Error values can be returned as well, depending on your "ErrorHandling" configuration setting.
  • SQL Server
    • As a table-valued UDF, returns a fixed table of the format:
      (RowNumber int
      , ColumnNumber int
      , Value nvarchar(MAX))
    • RowNumber can be used to group related columns. The implied primary key is (RowNumber, ColumnNumber).
    • No data matching the query expression implies no rows will be returned from the UDF.
    • Underlying errors are returned as exceptions by default. Other options are available depending on your "ErrorHandling" configuration setting.

Remarks

This function is typically used to extract tabular data from a data source that either is tabular by nature (e.g. CSV, Excel), or tabular details can be inferred (e.g. HTML).

When used for Excel Desktop, data is returned to populate a cell range that you would typically define by selecting multiple rows and cells, enter the function, and hit Ctrl-Shift-Enter. With Excel "Web", the target cell reference can be a single cell in which case the imported table is top-left anchored at that cell, or a range can be specified as the maximum bounds for the imported table. The order of the data is determined solely by the order in the source, unless the tableValueSort parameter is used. For the Excel "Web" add-in, functionality described here for the Desktop add-in can be carried over based on the parameter names mapping to the fields made available in the data connection in the add-in task pane.

When used for SQL Server, data is returned in a fixed tabular format that includes row and column numbers along with the unaltered text data corresponding to the data source. We recommend using the Script Generator or an applicable SQL-Hero template to create a wrapper when calling WebGetTable. The generated wrapper can be an inline UDF or a view that includes columns that have names corresponding to the columns of the data source, rather than row and column numbers as are returned by WebGetTable. (However, there is no reason you can't use WebGetTable directly.) The process of creating wrappers is covered in more detail in the xSkrape fundamentals white-paper.

Examples

  • Excel

    1. Pull one of several tables from a HTML page, overriding default timeout and throttling settings, and renaming columns

    =WebGetTable("http://finance.yahoo.com/q/ae?s=GS+Analyst+Estimates", "columnname=RevenueEst", TRUE, "CurrentQtrSep15 becomes CurrentQtr && NextQtrDec15 becomes NextQtr && CurrentYearDec15 becomes CurrentYr && NextYearDec16 becomes NextYr", "Timeout=200;MaxRequestCount=2000;MaxRequestIntervalSizeSeconds=300;MaxRequestWaitSeconds=120;MinRequestIntervalMilliseconds=10")

    This URL has a number of candidate tables we could get data from. In this example we're picking the one that has a column called "RevenueEst" as shown when using Page Explorer. (In fact, Page Explorer offers this as a suggestion.) Note that the name may not match the exact text on the page - whitespace can be removed. The "TRUE" here indicates we want to include column names as determined from the source, and our cell range selected accounts for this extra row. The next parameter is composed of four column renames where we've effectively stripped out year references. The final parameter overrides the default 90 second request timeout with 200 seconds, and establishes non-default throttling settings which among other things sets the minimum interval between requests to be 10 milliseconds instead of the default of 25.

    2. Pull data from a tab-separated value (TSV) file located on OneDrive

    =WebGetTable("https://codexenterprisesllc-my.sharepoint.com/personal/sqlheroguy_codexenterprisesllc_onmicrosoft_com/_layouts/15/guestaccess.aspx?guestaccesstoken=ykGxUq9k2htUzjAPlJWCX%2fVPoozyAkPvT70Qk5bpcrc%3d&docid=0a99014156eb240c2be82c206e20e2578", "tableindex=1", FALSE)

    This example assumes we've put a tab-separted file on OneDrive and created a sharing link for it. (This is covered in more detail in the Pulling a cloud-based CSV file into Excel example.) The table reference "tableindex=1" is consistent with the fact there's only one table available: formats such as CSV and TSV are tabular by nature. After selecting a cell range that corresponds to the raw data and hitting Ctrl-Shift-Enter, data fills the range such as illustrated here (A3:C6), where we've added our own column headings in row 2:

    3. Pull data from a paged grid on an HTML page, with pre-sorting and pre-filtering

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE)

    With the function call illustrated here, we can pull multiple pages of data from a paged grid into a single Excel cell range:

    By defining a fixed cell range with 11 rows and including column headers in the output, we're effectively taking the first ten rows of data from the source. We've added filters on column headings, which if we were to filter to say "Electronics" only, returns a subset of that first ten:

    If we instead intended to filter over all source data, we could do that using a filter expression in our function call:

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "", "", "Category='Electronics'")

    We can go a step further and add a sort as well, creating a "top 10" based on Quantity, filtered to two different categories (illustrating the fact that the filter expression can include multiple conditions):

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "", "", "Category='Electronics' OR Category='Clothing'", "Quantity DESC")

    Our cell range is now full, with the sort eliminating rows that have a lower quantity than the "top 10":

    4. Error handling scenarios

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "ErrorHandling=ReturnValue", "N/A", "aaa")

    In this example, we've purposely induced an error: the filter expression of "aaa" is not valid (a syntax error, effectively). Default error handling causes every cell in the range to show up as "#VALUE!". In the above example we've changed it to "ReturnValue" which now fills every cell with an error message, namely "xSkrape: Cannot find column [aaa]." If we were to use this instead:

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "ErrorHandling=NoData", "N/A", "aaa")

    Now every cell is filled with "N/A" which is our "no data" value specified in the sixth parameter to the function.

  • SQL Server

    1. Compute a scalar aggregate using one of several tables from a HTML page, overriding default timeout and throttling settings, using raw column and row numbers

    SELECT AVG(CONVERT(decimal(16,0), REPLACE([Value], 'B', '') * 1000000000.0))
    FROM dbo.WebGetTable('http://localhost'
        , 'http://finance.yahoo.com/q/ae?s=GS+Analyst+Estimates'
        , 'columnname=RevenueEst'
        , 'Timeout=200;MaxRequestCount=2000;MaxRequestIntervalSizeSeconds=300;MaxRequestWaitSeconds=120;MinRequestIntervalMilliseconds=10')
    WHERE
        ColumnNumber = 3
    AND RowNumber IN (2, 3)
    AND CHARINDEX('B', [Value]) > 0

    This URL has a number of candidate tables we could get data from. In this example we're picking the one that has a column called "RevenueEst" as shown when using Page Explorer. (In fact, Page Explorer offers this as a suggestion.) Note that the name may not match the exact text on the page - whitespace can be removed. The final parameter overrides the default 90 second request timeout with 200 seconds, and establishes non-default throttling settings which among other things sets the minimum interval between requests to be 10 milliseconds instead of the default of 25.

    The WHERE clause is filtering to a specific column, by ordinal position (fourth column since the number is zero-based). It's also filtering by row number (again, zero-based), and doing a basic data validation (ensuring the data value contains the character 'B'). The SELECT clause returns an average from the two rows that are filtered against, adjusting the value to reflect it's expressed in billions.

    2. Pull data from a tab-separated value (TSV) file located on OneDrive, using a wrapper view created using Script Generator

    CREATE VIEW [dbo].[TSV_Example]
    AS
    SELECT
        MAX(CONVERT(int, (CASE WHEN ColumnNumber=0 AND ISNUMERIC([Value])=1 THEN [Value] ELSE NULL END))) As [ID]
        , MAX(CONVERT(float, (CASE WHEN ColumnNumber=1 AND ISNUMERIC([Value])=1 THEN [Value] ELSE NULL END))) As [Price]
        , MAX(CASE WHEN ColumnNumber=2 THEN [Value] ELSE NULL END) As [Name]
    FROM
      [dbo].WebGetTable('http://localhost'
        , 'https://codexenterprisesllc-my.sharepoint.com/personal/sqlheroguy_codexenterprisesllc_onmicrosoft_com/_layouts/15/guestaccess.aspx?guestaccesstoken=ykGxUq9k2htUzjAPlJWCX%2fVPoozyAkPvT70Qk5bpcrc%3d&docid=0a99014156eb240c2be82c206e20e2578'
        , 'tableindex=1'
        , '')
    GROUP BY
        RowNumber;

    The above code was generated from the standalone Script Generator tool, using the following parameters entered after clicking on the "Create Script" button:


    3. Pull data from a paged grid on an HTML page, using a wrapper view created using SQL-Hero

    SELECT * FROM dbo.WebGetTable('http://localhost', '{ "url":"https://www.codexframework.com/xskrape/sampledata1?page={0}", "method":"step_sequential", "stop_on":"any", "arguments":[{"start":1, "step":1}] }', 'tableindex=1', NULL)

    If we assume that the above SELECT gives us the raw data we're interested in, how can we create a wrapper for it using SQL-Hero? Within the SQL Editor tool, we can go to the Template Manager (under the Templates menu option):

    Opening the template, we can use the "Save and Run Template" menu option:

    We're now prompted for template parameters:

    We can enter parameters that correspond closely to the parameters we used above, for the raw call to WebGetTable. After we've provided them and clicked "Ok", the template engine creates our view in a SQL editor window, ready to execute.

    4. Error handling scenarios

    There are only two supported error handling strategies when dealing with WebGetTable for SQL: Exception and ExceptionRelaxed. "ExceptionRelaxed" is the default and implies a hard SQL error (severity 16) is raised for what can be considered "severe" errors, and other situations which are potentially transient (e.g. such as a 503 error from the data source) simply return no data. (In turn offering application level error handling when there are no rows returned, when rows might otherwise be expected.) The alternative setting of "Exception" causes the SQL severity 16 error to be raised in most cases that are considered an error. Switching the error handling mode is possible using the fourth parameter to WebGetTable:

    SELECT * FROM [dbo].WebGetTable('http://localhost', 'http://wowowowow', 'tableindex=1', 'ErrorHandling=ExceptionRelaxed')

Related