Function: WebGetTable

Extract tabular data (rows and columns) from a source reference, using query conditions to identify the characteristics of the data in the event there are multiple tables that could be identified in the source.

Usage

  • Excel (Online)

Parameters

  • Excel
    • Target
      Names the specific cell that will be the top-left anchor of the retrieved table, or the target can be a cell range that defines the maximum extent of the returned table. Only applies to Excel "Web" Add-in.
    • url
      A reference to the source data to be queried. Can be a reference that is HTTP, HTTPS or FILE based. Can optionally be an expression that supports loading data across multiple requests (or performs other special parsing), as described here. Required. Excel "Web" settings.
    • tableCriteriaQuery
      A query specification to isolate a specific table from the raw source data. Details about the query syntax are covered here. If there are multiple tables that meet the given criteria, the first table is used. Optional. If this is omitted, uses "tableindex=1" as the default value. Excel "Web" settings.
    • includeHeaders
      If True, table header names are included in the returned data feed. Optional - default behavior is no table headers (just raw contents).
    • headerRenaming
      An optional set of column renaming expressions of the format "OldColumnName becomes NewColumnName", delimited by "&&" if there are multiple. (See examples.) An empty string or omitted implies header names will be as they are in the data source.
    • tableValueFilter
      A filter expression that can be applied to the table to help isolate a candidate row or rows of interest. The format for the expression is more or less the same as described here. Optional. Excel "Web" settings.
    • tableValueSort
      A string that contains the column name followed by "ASC" (ascending) or "DESC" (descending). Columns are sorted ascending by default. Multiple columns can be separated by commas. Optional.

Return Value(s)

  • Excel
    • Returns values of an appropriate type. For example, a value that looks like a number will be returned as a numeric (i.e. can be formatted as a number). Populates a range of cells. Can be the same size as the source data, or a subset, or a superset (if a subset, takes from the top left-most fields; if a superset, additional rows and columns populate based on the noDataValue parameter).
    • Values that can't be parsed to a native type are left as strings.

Remarks

This function is typically used to extract tabular data from a data source that either is tabular by nature (e.g. CSV, Excel), or tabular details can be inferred (e.g. HTML).

When used for Excel Desktop, data is returned to populate a cell range that you would typically define by selecting multiple rows and cells, enter the function, and hit Ctrl-Shift-Enter. With Excel "Web", the target cell reference can be a single cell in which case the imported table is top-left anchored at that cell, or a range can be specified as the maximum bounds for the imported table. The order of the data is determined solely by the order in the source, unless the tableValueSort parameter is used. For the Excel "Web" add-in, functionality described here for the Desktop add-in can be carried over based on the parameter names mapping to the fields made available in the data connection in the add-in task pane.

Examples

  • Excel

    1. Pull one of several tables from a HTML page, overriding default timeout and throttling settings, and renaming columns

    =WebGetTable("http://finance.yahoo.com/q/ae?s=GS+Analyst+Estimates", "columnname=RevenueEst", TRUE, "CurrentQtrSep15 becomes CurrentQtr && NextQtrDec15 becomes NextQtr && CurrentYearDec15 becomes CurrentYr && NextYearDec16 becomes NextYr", "Timeout=200;MaxRequestCount=2000;MaxRequestIntervalSizeSeconds=300;MaxRequestWaitSeconds=120;MinRequestIntervalMilliseconds=10")

    This URL has a number of candidate tables we could get data from. In this example we're picking the one that has a column called "RevenueEst" as shown when using Page Explorer. (In fact, Page Explorer offers this as a suggestion.) Note that the name may not match the exact text on the page - whitespace can be removed. The "TRUE" here indicates we want to include column names as determined from the source, and our cell range selected accounts for this extra row. The next parameter is composed of four column renames where we've effectively stripped out year references. The final parameter overrides the default 90 second request timeout with 200 seconds, and establishes non-default throttling settings which among other things sets the minimum interval between requests to be 10 milliseconds instead of the default of 25.

    2. Pull data from a tab-separated value (TSV) file located on OneDrive

    =WebGetTable("https://codexenterprisesllc-my.sharepoint.com/personal/sqlheroguy_codexenterprisesllc_onmicrosoft_com/_layouts/15/guestaccess.aspx?guestaccesstoken=ykGxUq9k2htUzjAPlJWCX%2fVPoozyAkPvT70Qk5bpcrc%3d&docid=0a99014156eb240c2be82c206e20e2578", "tableindex=1", FALSE)

    This example assumes we've put a tab-separted file on OneDrive and created a sharing link for it. (This is covered in more detail in the Pulling a cloud-based CSV file into Excel example.) The table reference "tableindex=1" is consistent with the fact there's only one table available: formats such as CSV and TSV are tabular by nature. After selecting a cell range that corresponds to the raw data and hitting Ctrl-Shift-Enter, data fills the range such as illustrated here (A3:C6), where we've added our own column headings in row 2:

    3. Pull data from a paged grid on an HTML page, with pre-sorting and pre-filtering

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE)

    With the function call illustrated here, we can pull multiple pages of data from a paged grid into a single Excel cell range:

    By defining a fixed cell range with 11 rows and including column headers in the output, we're effectively taking the first ten rows of data from the source. We've added filters on column headings, which if we were to filter to say "Electronics" only, returns a subset of that first ten:

    If we instead intended to filter over all source data, we could do that using a filter expression in our function call:

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "", "", "Category='Electronics'")

    We can go a step further and add a sort as well, creating a "top 10" based on Quantity, filtered to two different categories (illustrating the fact that the filter expression can include multiple conditions):

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "", "", "Category='Electronics' OR Category='Clothing'", "Quantity DESC")

    Our cell range is now full, with the sort eliminating rows that have a lower quantity than the "top 10":

    4. Error handling scenarios

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "ErrorHandling=ReturnValue", "N/A", "aaa")

    In this example, we've purposely induced an error: the filter expression of "aaa" is not valid (a syntax error, effectively). Default error handling causes every cell in the range to show up as "#VALUE!". In the above example we've changed it to "ReturnValue" which now fills every cell with an error message, namely "xSkrape: Cannot find column [aaa]." If we were to use this instead:

    =WebGetTable("{ ""url"":""https://www.codexframework.com/xskrape/sampledata1?page={0}"", ""method"":""step_sequential"", ""stop_on"":""any"", ""arguments"":[{""start"":1, ""step"":1}] }", "tableindex=1", TRUE, "", "ErrorHandling=NoData", "N/A", "aaa")

    Now every cell is filled with "N/A" which is our "no data" value specified in the sixth parameter to the function.

Related