CodeX: Get Value Query Specification

Get Value Query Specification

When we're interested in a single value from raw source data, we can use one or more query terms to a) locate the data, b) possibly transform and return the data. The queries are composed of a criteria string, followed by an equal sign, followed by some text that is specific to the criteria being applied. If more than one criteria is required, they can be combined using the "&&" symbols. (Those familiar with C# will recognize that as a "logical and" - which is what combining the terms effectively is.) When multiple terms are combined, they are evaluated sequentially. This allows you to for example strip away content using a criteria string like "removeregex", and then identify the first regular expression match over the remaining text using "regex". In this manner, text is reduced in each step until the remaining text is what's returned by the calling function.

Criteria Overview

xpath=xpath_expression

The source data must either be in XML to begin with or can be transformed into valid XML by xSkrape. Using Page Explorer's "As XML" option for loading data, you can get an idea of what the "normalized" XML structure would be. The xpath_expression should be a valid XPath expression that returns a single node. An xpath expression against an HTML page can become invalid if the page structure changes - sometimes in trivial ways. (We recommend other criteria unless you're confident the page will remain stable.) The actual value returned by the function using "xpath" is the inner text of the matched XML node.

multilinexpath=xpath_expression

Similar to xpath, except the XPath expression can return a node set. The inner text of all matched nodes is concatenated with line-feeds and is either returned or passed to the next query expression if there is a subsequent one.

removeregex=regular_expression

The provided regular expression is applied against the source data and if there's a match, it's replaced with an empty string and the result is either returned or passed to the next query expression if there is a subsequent one.

regex=regular_expression

The provided regular expression is applied against the source data and if there's a match, the matched text is either returned or passed to the next query expression if there is a subsequent one.

removetext=text

Similar to removeregex, except the text is treated as a simple case-insensitive match, not a regular expression match.

numberfollowsnear=regular_expression

An attempt is made to match the input regular expression with the source data. Matches are prioritized to pick the one with the closest number following the matched text, and it's the actual number that's either returned or passed to the next query expression if there is a subsequent one.

numberfollowinginnertext=regular_expression

Similiar to numberfollowsnear, except uses the rules of followinginnertext.

currencyfollowsnear=regular_expression

Similar to numberfollowsnear, except matched values must be preceded with $.

numberwithsuffixfollowsnear=regular_expression

Similar to numberfollowsnear, except values with the following suffixes can be matched as well:

K - assumes "thousands" and as such, multiplies the value by a thousand before returning it

M - assumes "millions" and as such, multiplies the value by a million before returning it

B - assumes "billions" and as such, multiplies the value by a billion before returning it

% - assumes "percentage" and as such, divides the value by a hundred before returning it

numberwithsuffixfollowinginnertext=regular_expression

Similiar to numberwithsuffixfollowsnear, except uses the rules of followinginnertext.

followinginnertext=regular_expression

Text that matches the input regular expression serves as the "starting point" for a text search. The search looks forward in the data for text that appears to be enclosed within an XML element, such as for example:


                            ...<td>sometext</td>

or:


                            ...>sometext</td>

It is the content that's either returned or passed to the next expression. (In the above example, "sometext".)

This criteria has the advantage of allowing you to identify text in a manner that's somewhat more resilient to page changes, if you assume for example that data contained within HTML tables will continue to be presented that way and that the labels used are unlikely to change. This is in contrast to using "xpath" where even some kinds of minor stylistic change can invalidate the XPath expression being used.

wordsfollowafter=regex

Returns full words (or phrases) that follow the expression. This can behave in a similar way to followsinnertext but will skip the next complete element, if one is present, as opposed to returning its inner text.

firstelement=name

Returns the inner text of the first element with the provided name.

alltextafter=text

Typically used as one step in a chained set of expressions, strips away all text before (and including) the input string.

validate=int|decimal

This criteria assumes you've isolated a piece of text that you are expecting to be either an integer or a decimal value. If the text is not the type you've specified, a blank is returned instead.

parsedate=auto

"auto" is the only mode currently supported. It works by first trying to parse the current text as a standard date (and/or time). If it's unsuccessful, additional date parsing rules are used that allow a date like this be recognized and returned in a standardized form: "Tue, Jul 21, 2015, 4:33pm EDT".

maxlength=value

A parsed value can only be returned if its textual length does not exceed this setting.

raw=all|nospaces

Currently processed text is returned, as-is (for all), or with spaces removed (for nospaces). When spaces are removed, this does not apply for quoted strings where spaces are preserved.

Too little detail	Too confusing / unclear	Needs better visuals
Spelling / grammar issues	Bugs / broken links	Needs to be more prominent!
Looks good to me!	Page looks (other):