CodeX: URL Expression Specification

URL Expression Specification

There are two main formats for source data references:

A valid URL.
A data reference expression.

Regular URL references include those starting with "HTTP://", "HTTPS://" or "FILE://". ("File" can reference a file that's local or accessible on a network share - not applicable for hosted xSkrape.) Data reference expressions are actually in JSON format, conforming to certain rules. The actual JSON must comply with the following object model, expressed using C#:


            public class JsonEvent
        {
            // For: Common elements

            // The canonical URL or a URL that contains a substitution parameters (eg. {0})
            public string url;

            // A valid processing method (default: none), one of none, step_sequential, enumerate_files, parse_json, parse_xml, step_sequential_parse_json, step_sequential_parse_xml
            public string method;

            // In an iterated situation, the maximum number of iterations allowed (default: 10000)
            public int? max_pass;

            // In an iterated situation, if the maximum number of iterations is reached, indicates whether the condition is an error or not (default: false / error)
            public bool? max_pass_ok;

            // Zero, one or many custom headers to include in the request
            public string[] headers;

            // Zero, one or many post variables to include in the request (of the form ["key", "value"])
            public string[][] post;

            // Default is GET, but can change via this
            public string action;

            // An optional username to include in the request
            public string username;

            // An optional password to include in the request
            public string password;

            // For: parse methods

            // xpath expression to identify row root node(s)
            public string rows_path;

            // none, all, innertext - if not none, shreds text into individual xml nodes based on line breaks
            public string split_lines;

            // If provided, applies a text filter prior to attempting to parse for tabular data
            public string prefilter;

            // One or more column definitions, within each row found
            public JsonEventJsonColumn[] columns;

            // Optionally identify a node-set relative to the row node that provides column names as inner text of each node in the set
            public string columns_path;

            // Optionally identify a node-set relative to the document that provides column names
            public string column_names_path;

            // Optionally identify zero, one or many join_set(s)
            public JoinSet[] join_sets;
            public JoinSet join_set;

            // For: step_sequential methods

            // Defines in iterative scenarios that involve checking for stop conditions, what stop condition(s) are checked. Valid values: any, no_data, dup_data. (Default: any)
            public string stop_on;

            // Zero, one or many arguments that replace substitution tags in the URL during iteration
            public JsonEventStepSeqArgument[] arguments;

            // For: enumerate_files (not valid in hosted functions)

            // Valid values: one_drive, local (default: local)
            public string folder_type;

            // Path specification for one or more source files
            public string file_path;

            // Optional filename regular expression pattern
            public string filename_regex_pattern;

            // Zero, one or many column specifications that are appended to the result set, pulling values from each filename involved
            public JsonEventFilenameColumn[] filename_columns;
        }

        public class JsonEventStepSeqArgument
        {
            // The start value for the parameter (optional, default: 1)
            public int? start;

            // The step value for the parameter - the amount added to the start and then each subsequent parameter value (optional, default: 1)
            public int? step;

            // An optional comma or pipe delimited list of values to iterate and replace in the URL, versus using a number
            public string parm_list;

            // When present, the current value from the parm_list is appended to the result set in a column with this name
            public string parm_column_name;
        }

        public class JsonEventFilenameColumn
        {
            // A regular expression that is applied to the source filename - the matched portion is added to the result set in the named column
            public string filename_regex;

            // The name of the column to append to the result set for the matched portion of the filename
            public string column_name;
        }

        public class JsonEventJsonColumn
        {
            // Relative xpath spec, identifying the column value node
            public string path;

            // The column name to use in the result set
            public string name;

            // The data type to use in the result set; can include: String, DateTime, Int16, Int32, Int64, etc. (default: String)
            public string datatype;

            public string format;

            // Can be blank or "url"; if "url", performs url encoding on the value for the column when added to the result set
            public string encode;

            // An additional parsing pass is possible over the candidate data to return for this column
            public string parse;

            // Describe the absolute text position for the start of the column data
            public int? position;

            // Describe the absolute text length for the column data
            public int? length;
        }

        public class JoinSet
        {
            // Relative xpath spec, identifying the root of the join set
            public string path;

            // One or more column specifications, relative to the join set path, that belong to this join set
            public JsonEventJsonColumn[] columns;
        }

The following are to be taken as examples of usage of the formal object model, defined above.

Multiple requests, sequential loop { url:"urlspec"
, method:"step_sequential"
, headers:["CustomHeader: HeaderValue", ...]
, max_pass:"looplimit"
, max_pass_ok:"true|false"
, stop_on:"any|no_data|dup_data"
, arguments:[{start:"argstartvalue"
, step:"argincrementvalue"}
, parm_list:"parm_list_text"
, parm_column_name:"parm_column_name"}, ...] }

This allows you to issue multiple parameterized requests with the resulting data being merged into a single result set based on your table matching criteria. "arguments" can map to one or more substitution parameters in the URL, matched in order (i.e. first argument matches {0}, second matches {1}, etc.). The typical use case for this is a paged grid scenario where a query parameter such as "page" (for page number - i.e. "http://somehost/somepagename?page={0}") identifies the current page of data being rendered.

urlspec - a URL used for requests that can include substitution values {0}, {1}, {2}, etc. for one or more parameters that are replaced for each request

CustomHeader: HeaderValue - zero, one or more request headers can be specified in standard header format

step_sequential - identifies the rule that handles multiple requests: in this case, a starting value is used with sequential incrementing until a stop condition is met

looplimit - Optional; the maximum number of iterations allowed (default is 500); can act as a fail-safe

max_pass_ok - Optional; the default behavior assumes "false" implying that if the looplimit is reached, it's considered an error condition; "true" implies the data is simply truncated silently at the looplimit.

no_data - stops when no data is found matching the table matching criteria; dup_data - stops when data is found that matches any previously-retrieved data; any - stops when either no_data or dup_data would be satisfied

argstartvalue - a numeric value representing the value used by the first request for the corresponding subsitution parameter

argincrementvalue - a numeric value representing the value added to the previous request's corresponding subsitution parameter, for the next request

parm_list_text - an optional string containing either pipe (|) or comma separated values that are passed to the URL (in URL encoded format) as substitution values, over multiple requests

parm_column_name - when using parm_list_text, names a column in the resulting table that holds the value of the item passed to the request URL

See Example

Parse JSON or XML as tabular data

Available version 3.0+ (non-Web), 1.0 (Web) { url:"urlspec"
, method:"parse_json | parse_xml"
, headers:["CustomHeader: HeaderValue", ...]
, rows_path:"row_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name"}, ...]
, columns_path:"columns_xpath"
, column_names_path:"column_names_xpath"
, join_set: { path:"setroot_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }
, join_sets: [ {path:"setroot_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }, ... ]
}

This allows you to take source data that can be interpreted as JSON or XML and shape it into a tabular format.

urlspec - a URL used for the request

parse_json OR parse_xml - identifies the rule that parses source data from JSON or XML into a tabular format

CustomHeader: HeaderValue - zero, one or more request headers can be specified in standard header format

row_xpath - an XPath expression that identifies the level at which tabular rows will be isolated (see examples)

column_xpath - an XPath expression that originates under each node found from the row_path expression to retrieve data for the given column

column_name - explicitly names the given column (optional - if omitted, bases on the column_path)

type_name - explicitly defines the data type associated with the given column; valid values are based on .NET data types including ("String", "Int32", "DateTime", etc.). (Optional - if omitted, assumes "String")

col_encode - can be "url" or omitted entirely; if "url", the values assigned in the column will be URL encoded per standard encoding rules

columns_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that represents individual columns

column_names_xpath - an XPath expression that originates from the root of the document, returning a node set for which the text of each node should become the names for each column available

setroot_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that serves as the row_xpath for a nested table that is cross joined to the outer row

See Example


                                    { url: "https://api.github.com/search/repositories?q=Excel+language:javascript&sort=stars&order=desc", method:"parse_json", rows_path:"items", columns:[{ path: "id", name: "ID", datatype:"Int32" }, { path:"description" }, {path: "pushed_at", name:"PushedDate", datatype:"DateTime" }, { path:"owner/login" }] }

For Excel Desktop add-in, this will be included in a quoted parameter, so quotes must turn into double-quotes:


                                    =WebGetTable("{ url: ""https://api.github.com/search/repositories?q=Excel+language:javascript&sort=stars&order=desc"", method:""parse_json"", rows_path:""items"", columns:[{ path: ""id"", name: ""ID"", datatype:""Int32"" }, { path:""description"" }, {path: ""pushed_at"", name:""PushedDate"", datatype:""DateTime"" }, { path:"'owner/login"" }] }", ...

In order to understand how to create tabular data from hierarchical data such as XML or JSON, understand that internally we treat both as XML, which allows XPath to work for both. XPath is useful since we can write a query that identifies both the root of the nodes that will be iterated to become one-per-row, and we can identify fields that feed column values within each row. For example, consider the following source data (coming from the example URL shown above):

(Click image to enlarge)

The red lines show the first 4 elements that identify rows: one row per "items" element. This corresponds to the rows_xpath setting. (The "root" element is implied if the XPath expression starts with text, so "items" is actually treated as "/root/items".)

The green lines identify the "id" column within each row. Relative to the "items" element, the XPath to refer to this is simply "id" (the path parameter). In this example, we've named the resulting column "ID" and specified its data type as "Int32".

The blue lines identify the "login" column within each row. Relative to the "items" element, the XPath to refer to this is "owner/login". Since no column name is specified, "login" is assumed and "String" is assumed as the data type.

Multiple requests, sequential loop - JSON/XML source

Available version 3.0+ (non-Web), 1.0 (Web) { url:"urlspec"
, method:"step_sequential_parse_json | step_sequential_parse_xml"
, headers:["CustomHeader: HeaderValue", ...]
, rows_path:"row_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name"}, ...]
, columns_path:"columns_xpath"
, column_names_path:"column_names_xpath"
, join_set: { path:"setroot_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }
, join_sets: [ {path:"setroot_xpath"
, columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }, ...]
, max_pass:"looplimit"
, max_pass_ok:"true|false"
, stop_on:"any|no_data|dup_data"
, arguments:[{start:"argstartvalue"
, step:"argincrementvalue"}
, parm_list:"parm_list_text"
, parm_column_name:"parm_column_name"}, ...] }

This is a variation of "Multiple requests, sequential loop" combined with "parse_json" / "parse_xml", described above. It allows you to merge multiple requests with the ability to parse the data from each request from JSON or XML into a tabular format.

urlspec - a URL used for requests that can include substitution values {0}, {1}, {2}, etc. for one or more parameters that are replaced for each request

step_sequential_parse_json OR step_sequential_parse_xml - identifies the rule that handles multiple requests AND parses source data from JSON or XML into a tabular format