URL Expression Specification

There are two main formats for source data references:

  • A valid URL.
  • A data reference expression.

Regular URL references include those starting with "HTTP://", "HTTPS://" or "FILE://". ("File" can reference a file that's local or accessible on a network share - not applicable for hosted xSkrape.) Data reference expressions are actually in JSON format, conforming to certain rules. The actual JSON must comply with the following object model, expressed using C#:

        public class JsonEvent
        {
            // For: Common elements

            // The canonical URL or a URL that contains a substitution parameters (eg. {0})
            public string url;

            // A valid processing method (default: none), one of none, step_sequential, enumerate_files, parse_json, parse_xml, step_sequential_parse_json, step_sequential_parse_xml
            public string method;

            // In an iterated situation, the maximum number of iterations allowed (default: 10000)
            public int? max_pass;

            // In an iterated situation, if the maximum number of iterations is reached, indicates whether the condition is an error or not (default: false / error)
            public bool? max_pass_ok;

            // Zero, one or many custom headers to include in the request
            public string[] headers;

            // Zero, one or many post variables to include in the request (of the form ["key", "value"])
            public string[][] post;

            // Default is GET, but can change via this
            public string action;

            // An optional username to include in the request
            public string username;

            // An optional password to include in the request
            public string password;

            // For: parse methods

            // xpath expression to identify row root node(s)
            public string rows_path;

            // none, all, innertext - if not none, shreds text into individual xml nodes based on line breaks
            public string split_lines;

            // If provided, applies a text filter prior to attempting to parse for tabular data
            public string prefilter;

            // One or more column definitions, within each row found
            public JsonEventJsonColumn[] columns;

            // Optionally identify a node-set relative to the row node that provides column names as inner text of each node in the set
            public string columns_path;

            // Optionally identify a node-set relative to the document that provides column names
            public string column_names_path;

            // Optionally identify zero, one or many join_set(s)
            public JoinSet[] join_sets;
            public JoinSet join_set;

            // For: step_sequential methods

            // Defines in iterative scenarios that involve checking for stop conditions, what stop condition(s) are checked. Valid values: any, no_data, dup_data. (Default: any)
            public string stop_on;

            // Zero, one or many arguments that replace substitution tags in the URL during iteration
            public JsonEventStepSeqArgument[] arguments;

            // For: enumerate_files (not valid in hosted functions)

            // Valid values: one_drive, local (default: local)
            public string folder_type;

            // Path specification for one or more source files
            public string file_path;

            // Optional filename regular expression pattern
            public string filename_regex_pattern;

            // Zero, one or many column specifications that are appended to the result set, pulling values from each filename involved
            public JsonEventFilenameColumn[] filename_columns;
        }

        public class JsonEventStepSeqArgument
        {
            // The start value for the parameter (optional, default: 1)
            public int? start;

            // The step value for the parameter - the amount added to the start and then each subsequent parameter value (optional, default: 1)
            public int? step;

            // An optional comma or pipe delimited list of values to iterate and replace in the URL, versus using a number
            public string parm_list;

            // When present, the current value from the parm_list is appended to the result set in a column with this name
            public string parm_column_name;
        }

        public class JsonEventFilenameColumn
        {
            // A regular expression that is applied to the source filename - the matched portion is added to the result set in the named column
            public string filename_regex;

            // The name of the column to append to the result set for the matched portion of the filename
            public string column_name;
        }

        public class JsonEventJsonColumn
        {
            // Relative xpath spec, identifying the column value node
            public string path;

            // The column name to use in the result set
            public string name;

            // The data type to use in the result set; can include: String, DateTime, Int16, Int32, Int64, etc. (default: String)
            public string datatype;

            public string format;

            // Can be blank or "url"; if "url", performs url encoding on the value for the column when added to the result set
            public string encode;

            // An additional parsing pass is possible over the candidate data to return for this column
            public string parse;

            // Describe the absolute text position for the start of the column data
            public int? position;

            // Describe the absolute text length for the column data
            public int? length;
        }

        public class JoinSet
        {
            // Relative xpath spec, identifying the root of the join set
            public string path;

            // One or more column specifications, relative to the join set path, that belong to this join set
            public JsonEventJsonColumn[] columns;
        }

The following are to be taken as examples of usage of the formal object model, defined above.

Multiple requests, sequential loop { url:"urlspec"
    , method:"step_sequential"
    , headers:["CustomHeader: HeaderValue", ...]
    , max_pass:"looplimit"
    , max_pass_ok:"true|false"
    , stop_on:"any|no_data|dup_data"
    , arguments:[{start:"argstartvalue"
        , step:"argincrementvalue"}
        , parm_list:"parm_list_text"
        , parm_column_name:"parm_column_name"}, ...] }

This allows you to issue multiple parameterized requests with the resulting data being merged into a single result set based on your table matching criteria. "arguments" can map to one or more substitution parameters in the URL, matched in order (i.e. first argument matches {0}, second matches {1}, etc.). The typical use case for this is a paged grid scenario where a query parameter such as "page" (for page number - i.e. "http://somehost/somepagename?page={0}") identifies the current page of data being rendered.

urlspec - a URL used for requests that can include substitution values {0}, {1}, {2}, etc. for one or more parameters that are replaced for each request

CustomHeader: HeaderValue - zero, one or more request headers can be specified in standard header format

step_sequential - identifies the rule that handles multiple requests: in this case, a starting value is used with sequential incrementing until a stop condition is met

looplimit - Optional; the maximum number of iterations allowed (default is 500); can act as a fail-safe

max_pass_ok - Optional; the default behavior assumes "false" implying that if the looplimit is reached, it's considered an error condition; "true" implies the data is simply truncated silently at the looplimit.

no_data - stops when no data is found matching the table matching criteria; dup_data - stops when data is found that matches any previously-retrieved data; any - stops when either no_data or dup_data would be satisfied

argstartvalue - a numeric value representing the value used by the first request for the corresponding subsitution parameter

argincrementvalue - a numeric value representing the value added to the previous request's corresponding subsitution parameter, for the next request

parm_list_text - an optional string containing either pipe (|) or comma separated values that are passed to the URL (in URL encoded format) as substitution values, over multiple requests

parm_column_name - when using parm_list_text, names a column in the resulting table that holds the value of the item passed to the request URL

See Example

Parse JSON or XML as tabular data

Available version 3.0+ (non-Web), 1.0 (Web)
{ url:"urlspec"
    , method:"parse_json | parse_xml"
    , headers:["CustomHeader: HeaderValue", ...]
    , rows_path:"row_xpath"
    , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name"}, ...]
    , columns_path:"columns_xpath"
    , column_names_path:"column_names_xpath"
    , join_set: { path:"setroot_xpath"
        , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }
    , join_sets: [ {path:"setroot_xpath"
        , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }, ... ]
}

This allows you to take source data that can be interpreted as JSON or XML and shape it into a tabular format.

urlspec - a URL used for the request

parse_json OR parse_xml - identifies the rule that parses source data from JSON or XML into a tabular format

CustomHeader: HeaderValue - zero, one or more request headers can be specified in standard header format

row_xpath - an XPath expression that identifies the level at which tabular rows will be isolated (see examples)

column_xpath - an XPath expression that originates under each node found from the row_path expression to retrieve data for the given column

column_name - explicitly names the given column (optional - if omitted, bases on the column_path)

type_name - explicitly defines the data type associated with the given column; valid values are based on .NET data types including ("String", "Int32", "DateTime", etc.). (Optional - if omitted, assumes "String")

col_encode - can be "url" or omitted entirely; if "url", the values assigned in the column will be URL encoded per standard encoding rules

columns_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that represents individual columns

column_names_xpath - an XPath expression that originates from the root of the document, returning a node set for which the text of each node should become the names for each column available

setroot_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that serves as the row_xpath for a nested table that is cross joined to the outer row

See Example

Multiple requests, sequential loop - JSON/XML source

Available version 3.0+ (non-Web), 1.0 (Web)
{ url:"urlspec"
    , method:"step_sequential_parse_json | step_sequential_parse_xml"
    , headers:["CustomHeader: HeaderValue", ...]
    , rows_path:"row_xpath"
    , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name"}, ...]
    , columns_path:"columns_xpath"
    , column_names_path:"column_names_xpath"
    , join_set: { path:"setroot_xpath"
        , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }
    , join_sets: [ {path:"setroot_xpath"
        , columns:[{path:"column_xpath", name:"column_name", datatype:"type_name", encode:"col_encode"}, ...] }, ...]
    , max_pass:"looplimit"
    , max_pass_ok:"true|false"
    , stop_on:"any|no_data|dup_data"
    , arguments:[{start:"argstartvalue"
        , step:"argincrementvalue"}
        , parm_list:"parm_list_text"
        , parm_column_name:"parm_column_name"}, ...] }

This is a variation of "Multiple requests, sequential loop" combined with "parse_json" / "parse_xml", described above. It allows you to merge multiple requests with the ability to parse the data from each request from JSON or XML into a tabular format.

urlspec - a URL used for requests that can include substitution values {0}, {1}, {2}, etc. for one or more parameters that are replaced for each request

step_sequential_parse_json OR step_sequential_parse_xml - identifies the rule that handles multiple requests AND parses source data from JSON or XML into a tabular format

CustomHeader: HeaderValue - zero, one or more request headers can be specified in standard header format

row_xpath - an XPath expression that identifies the level at which tabular rows will be isolated (see examples)

column_xpath - an XPath expression that originates under each node found from the row_path expression to retrieve data for the given column

column_name - explicitly names the given column (optional - if omitted, bases on the column_path)

type_name - explicitly defines the data type associated with the given column; valid values are based on .NET data types including ("String", "Int32", "DateTime", etc.). (Optional - if omitted, assumes "String")

col_encode - can be "url" or omitted entirely; if "url", the values assigned in the column will be URL encoded per standard encoding rules

columns_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that represents individual columns

column_names_xpath - an XPath expression that originates from the root of the document, returning a node set for which the text of each node should become the names for each column available

setroot_xpath - an XPath expression that originates under each node found from the row_path expression, returning a node set that serves as the row_xpath for a nested table that is cross joined to the outer row

looplimit - Optional; the maximum number of iterations allowed (default is 500); can act as a fail-safe

max_pass_ok - Optional; the default behavior assumes "false" implying that if the looplimit is reached, it's considered an error condition; "true" implies the data is simply truncated silently at the looplimit.

no_data - stops when no data is found matching the table matching criteria; dup_data - stops when data is found that matches any previously-retrieved data; any - stops when either no_data or dup_data would be satisfied

argstartvalue - a numeric value representing the value used by the first request for the corresponding subsitution parameter

argincrementvalue - a numeric value representing the value added to the previous request's corresponding subsitution parameter, for the next request

parm_list_text - an optional string containing either pipe (|) or comma separated values that are passed to the URL (in URL encoded format) as substitution values, over multiple requests

parm_column_name - when using parm_list_text, names a column in the resulting table that holds the value of the item passed to the request URL

See Example

Pass Custom Headers Only

Available version 3.0+ (non-Web), 1.0 (Web)
{ url:"urlspec"
    , method:"none"
    , headers:["CustomHeader: HeaderValue", ...]
}

This allows you to issue a request using custom header values.

urlspec - the URL to use for the request

CustomHeader: HeaderValue - one or more request headers can be specified in standard header format

Do you have a suggestion for a use case we don't support currently? Let us know!