Function: WebGetTable

Extract tabular data (rows and columns) from a source reference, using query conditions to identify the characteristics of the data in the event there are multiple tables that could be identified in the source.

Usage

  • SQL Server
    WebGetTable(@serviceUrl nvarchar(1000)
    , @parseUrl nvarchar(MAX)
    , @tableCriteriaQuery nvarchar(MAX)
    , @additionalConfig nvarchar(2000))

Parameters

  • SQL Server
    • @serviceUrl
      A URL reference to your SQL-Hero application server services. Typically this would be "http://servername:46837" where "servername" matches the server on which you installed the SQL-Hero services, with port 46837 being the default port number used by the SQL-Hero installer program.
    • @parseUrl
      A reference to the source data to be queried. Can be a reference that is HTTP, HTTPS or FILE based. Can optionally be an expression that supports loading data across multiple requests, as described here.
    • @tableCriteriaQuery
      A query specification to isolate a specific table from the raw source data. Details about the query syntax are covered here. If there are multiple tables that meet the given criteria, the first table is used.
    • @additionalConfig
      An optional list of configuration parameters that control how the query is issued. Details about the configuration parameters syntax are covered here. Provide an empty string or NULL if there are no configuration parameters needed (i.e. default behavior).

Return Value(s)

  • SQL Server
    • As a table-valued UDF, returns a fixed table of the format:
      (RowNumber int
      , ColumnNumber int
      , Value nvarchar(MAX))
    • RowNumber can be used to group related columns. The implied primary key is (RowNumber, ColumnNumber).
    • No data matching the query expression implies no rows will be returned from the UDF.
    • Underlying errors are returned as exceptions by default. Other options are available depending on your "ErrorHandling" configuration setting.

Remarks

This function is typically used to extract tabular data from a data source that either is tabular by nature (e.g. CSV, Excel), or tabular details can be inferred (e.g. HTML).

When used for SQL Server, data is returned in a fixed tabular format that includes row and column numbers along with the unaltered text data corresponding to the data source. We recommend using the Script Generator or an applicable SQL-Hero template to create a wrapper when calling WebGetTable. The generated wrapper can be an inline UDF or a view that includes columns that have names corresponding to the columns of the data source, rather than row and column numbers as are returned by WebGetTable. (However, there is no reason you can't use WebGetTable directly.) The process of creating wrappers is covered in more detail in the xSkrape fundamentals white-paper.

Examples

  • SQL Server

    1. Compute a scalar aggregate using one of several tables from a HTML page, overriding default timeout and throttling settings, using raw column and row numbers

    SELECT AVG(CONVERT(decimal(16,0), REPLACE([Value], 'B', '') * 1000000000.0))
    FROM dbo.WebGetTable('http://localhost'
        , 'http://finance.yahoo.com/q/ae?s=GS+Analyst+Estimates'
        , 'columnname=RevenueEst'
        , 'Timeout=200;MaxRequestCount=2000;MaxRequestIntervalSizeSeconds=300;MaxRequestWaitSeconds=120;MinRequestIntervalMilliseconds=10')
    WHERE
        ColumnNumber = 3
    AND RowNumber IN (2, 3)
    AND CHARINDEX('B', [Value]) > 0

    This URL has a number of candidate tables we could get data from. In this example we're picking the one that has a column called "RevenueEst" as shown when using Page Explorer. (In fact, Page Explorer offers this as a suggestion.) Note that the name may not match the exact text on the page - whitespace can be removed. The final parameter overrides the default 90 second request timeout with 200 seconds, and establishes non-default throttling settings which among other things sets the minimum interval between requests to be 10 milliseconds instead of the default of 25.

    The WHERE clause is filtering to a specific column, by ordinal position (fourth column since the number is zero-based). It's also filtering by row number (again, zero-based), and doing a basic data validation (ensuring the data value contains the character 'B'). The SELECT clause returns an average from the two rows that are filtered against, adjusting the value to reflect it's expressed in billions.

    2. Pull data from a tab-separated value (TSV) file located on OneDrive, using a wrapper view created using Script Generator

    CREATE VIEW [dbo].[TSV_Example]
    AS
    SELECT
        MAX(CONVERT(int, (CASE WHEN ColumnNumber=0 AND ISNUMERIC([Value])=1 THEN [Value] ELSE NULL END))) As [ID]
        , MAX(CONVERT(float, (CASE WHEN ColumnNumber=1 AND ISNUMERIC([Value])=1 THEN [Value] ELSE NULL END))) As [Price]
        , MAX(CASE WHEN ColumnNumber=2 THEN [Value] ELSE NULL END) As [Name]
    FROM
      [dbo].WebGetTable('http://localhost'
        , 'https://codexenterprisesllc-my.sharepoint.com/personal/sqlheroguy_codexenterprisesllc_onmicrosoft_com/_layouts/15/guestaccess.aspx?guestaccesstoken=ykGxUq9k2htUzjAPlJWCX%2fVPoozyAkPvT70Qk5bpcrc%3d&docid=0a99014156eb240c2be82c206e20e2578'
        , 'tableindex=1'
        , '')
    GROUP BY
        RowNumber;

    The above code was generated from the standalone Script Generator tool, using the following parameters entered after clicking on the "Create Script" button:


    3. Pull data from a paged grid on an HTML page, using a wrapper view created using SQL-Hero

    SELECT * FROM dbo.WebGetTable('http://localhost', '{ "url":"https://www.codexframework.com/xskrape/sampledata1?page={0}", "method":"step_sequential", "stop_on":"any", "arguments":[{"start":1, "step":1}] }', 'tableindex=1', NULL)

    If we assume that the above SELECT gives us the raw data we're interested in, how can we create a wrapper for it using SQL-Hero? Within the SQL Editor tool, we can go to the Template Manager (under the Templates menu option):

    Opening the template, we can use the "Save and Run Template" menu option:

    We're now prompted for template parameters:

    We can enter parameters that correspond closely to the parameters we used above, for the raw call to WebGetTable. After we've provided them and clicked "Ok", the template engine creates our view in a SQL editor window, ready to execute.

    4. Error handling scenarios

    There are only two supported error handling strategies when dealing with WebGetTable for SQL: Exception and ExceptionRelaxed. "ExceptionRelaxed" is the default and implies a hard SQL error (severity 16) is raised for what can be considered "severe" errors, and other situations which are potentially transient (e.g. such as a 503 error from the data source) simply return no data. (In turn offering application level error handling when there are no rows returned, when rows might otherwise be expected.) The alternative setting of "Exception" causes the SQL severity 16 error to be raised in most cases that are considered an error. Switching the error handling mode is possible using the fourth parameter to WebGetTable:

    SELECT * FROM [dbo].WebGetTable('http://localhost', 'http://wowowowow', 'tableindex=1', 'ErrorHandling=ExceptionRelaxed')

Related