The Query Language

1. Introduction

This document is intended as a reference guide to the full syntax and semantics of AsterixDB’s query language, a SQL-based language for working with semistructured data. The language is a derivative of SQL++, a declarative query language for JSON data which is largely backwards compatible with SQL. SQL++ originated from research in the FORWARD project at UC San Diego, and it has much in common with SQL; some differences exist due to the different data models that the two languages were designed to serve. SQL was designed in the 1970’s for interacting with the flat, schema-ified world of relational databases, while SQL++ is much newer and targets the nested, schema-optional (or even schema-less) world of modern NoSQL systems.

In the context of Apache AsterixDB, the query language is intended for working with the Asterix Data Model (ADM), a data model based on a superset of JSON with an enriched and flexible type system. New AsterixDB users are encouraged to read and work through the (much friendlier) guide “AsterixDB 101: An ADM and SQL++ Primer” before attempting to make use of this document. In addition, readers are advised to read through the Asterix Data Model (ADM) reference guide first as well, as an understanding of the data model is a prerequisite to understanding the query language.

In what follows, we detail the features of the query language in a grammar-guided manner. We list and briefly explain each of the productions in the query grammar, offering examples (and results) for clarity.

2. Expressions

The query language is a highly composable expression language. Each expression in the query language returns zero or more data model instances. There are three major kinds of expressions. At the topmost level, an expression can be an OperatorExpression (similar to a mathematical expression) or a QuantifiedExpression (which yields a boolean value). Each will be detailed as we explore the full grammar of the language.

Expression ::= OperatorExpression | QuantifiedExpression

Note that in the following text, words enclosed in angle brackets denote keywords that are not case-sensitive.

Operator Expressions

Operators perform a specific operation on the input values or expressions. The syntax of an operator expression is as follows:

OperatorExpression ::= PathExpression
                       | Operator OperatorExpression
                       | OperatorExpression Operator (OperatorExpression)?
                       | OperatorExpression <BETWEEN> OperatorExpression <AND> OperatorExpression

The language provides a full set of operators that you can use within its statements. Here are the categories of operators:

The following table summarizes the precedence order (from higher to lower) of the major unary and binary operators:

Operator Operation
EXISTS, NOT EXISTS Collection emptiness testing
^ Exponentiation
*, /, DIV, MOD (%) Multiplication, division, modulo
+, - Addition, subtraction
|| String concatenation
IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING,
IS UNKNOWN, IS NOT UNKNOWN, IS VALUED, IS NOT VALUED
Unknown value comparison
BETWEEN, NOT BETWEEN Range comparison (inclusive on both sides)
=, !=, <>, <, >, <=, >=, LIKE, NOT LIKE, IN, NOT IN Comparison
NOT Logical negation
AND Conjunction
OR Disjunction

In general, if any operand evaluates to a MISSING value, the enclosing operator will return MISSING; if none of operands evaluates to a MISSING value but there is an operand evaluates to a NULL value, the enclosing operator will return NULL. However, there are a few exceptions listed in comparison operators and logical operators.

Arithmetic Operators

Arithmetic operators are used to exponentiate, add, subtract, multiply, and divide numeric values, or concatenate string values.

Operator Purpose Example
+, - As unary operators, they denote a
positive or negative expression
SELECT VALUE -1;
+, - As binary operators, they add or subtract SELECT VALUE 1 + 2;
* Multiply SELECT VALUE 4 * 2;
/ Divide (returns a value of type double if both operands are integers) SELECT VALUE 5 / 2;
DIV Divide (returns an integer value if both operands are integers) SELECT VALUE 5 DIV 2;
MOD (%) Modulo SELECT VALUE 5 % 2;
^ Exponentiation SELECT VALUE 2^3;
|| String concatenation SELECT VALUE “ab”||“c”||“d”;

Collection Operators

Collection operators are used for membership tests (IN, NOT IN) or empty collection tests (EXISTS, NOT EXISTS).

Operator Purpose Example
IN Membership test SELECT * FROM ChirpMessages cm
WHERE cm.user.lang IN [“en”, “de”];
NOT IN Non-membership test SELECT * FROM ChirpMessages cm
WHERE cm.user.lang NOT IN [“en”];
EXISTS Check whether a collection is not empty SELECT * FROM ChirpMessages cm
WHERE EXISTS cm.referredTopics;
NOT EXISTS Check whether a collection is empty SELECT * FROM ChirpMessages cm
WHERE NOT EXISTS cm.referredTopics;

Comparison Operators

Comparison operators are used to compare values. The comparison operators fall into one of two sub-categories: missing value comparisons and regular value comparisons. The query language (and JSON) has two ways of representing missing information in a object - the presence of the field with a NULL for its value (as in SQL), and the absence of the field (which JSON permits). For example, the first of the following objects represents Jack, whose friend is Jill. In the other examples, Jake is friendless a la SQL, with a friend field that is NULL, while Joe is friendless in a more natural (for JSON) way, i.e., by not having a friend field.

Examples

{“name”: “Jack”, “friend”: “Jill”}

{“name”: “Jake”, “friend”: NULL}

{“name”: “Joe”}

The following table enumerates all of the query language’s comparison operators.

Operator Purpose Example
IS NULL Test if a value is NULL SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS NULL;
IS NOT NULL Test if a value is not NULL SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS NOT NULL;
IS MISSING Test if a value is MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS MISSING;
IS NOT MISSING Test if a value is not MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS NOT MISSING;
IS UNKNOWN Test if a value is NULL or MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS UNKNOWN;
IS NOT UNKNOWN Test if a value is neither NULL nor MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS NOT UNKNOWN;
IS KNOWN (IS VALUED) Test if a value is neither NULL nor MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS KNOWN;
IS NOT KNOWN (IS NOT VALUED) Test if a value is NULL or MISSING SELECT * FROM ChirpMessages cm
WHERE cm.user.name IS NOT KNOWN;
BETWEEN Test if a value is between a start value and
a end value. The comparison is inclusive
to both start and end values.
SELECT * FROM ChirpMessages cm
WHERE cm.chirpId BETWEEN 10 AND 20;
= Equality test SELECT * FROM ChirpMessages cm
WHERE cm.chirpId=10;
!= Inequality test SELECT * FROM ChirpMessages cm
WHERE cm.chirpId!=10;
<> Inequality test SELECT * FROM ChirpMessages cm
WHERE cm.chirpId<>10;
< Less than SELECT * FROM ChirpMessages cm
WHERE cm.chirpId<10;
> Greater than SELECT * FROM ChirpMessages cm
WHERE cm.chirpId>10;
<= Less than or equal to SELECT * FROM ChirpMessages cm
WHERE cm.chirpId<=10;
>= Greater than or equal to SELECT * FROM ChirpMessages cm
WHERE cm.chirpId>=10;
LIKE Test if the left side matches a
pattern defined on the right
side; in the pattern, “%” matches
any string while “_” matches
any character.
SELECT * FROM ChirpMessages cm
WHERE cm.user.name LIKE “%Giesen%”;
NOT LIKE Test if the left side does not
match a pattern defined on the right
side; in the pattern, “%” matches
any string while “_” matches
any character.
SELECT * FROM ChirpMessages cm
WHERE cm.user.name NOT LIKE “%Giesen%”;

The following table summarizes how the missing value comparison operators work.

Operator Non-NULL/Non-MISSING value NULL MISSING
IS NULL FALSE TRUE MISSING
IS NOT NULL TRUE FALSE MISSING
IS MISSING FALSE FALSE TRUE
IS NOT MISSING TRUE TRUE FALSE
IS UNKNOWN FALSE TRUE TRUE
IS NOT UNKNOWN TRUE FALSE FALSE
IS KNOWN (IS VALUED) TRUE FALSE FALSE
IS NOT KNOWN (IS NOT VALUED) FALSE TRUE TRUE

Logical Operators

Logical operators perform logical NOT, AND, and OR operations over Boolean values (TRUE and FALSE) plus NULL and MISSING.

Operator Purpose Example
NOT Returns true if the following condition is false, otherwise returns false SELECT VALUE NOT TRUE;
AND Returns true if both branches are true, otherwise returns false SELECT VALUE TRUE AND FALSE;
OR Returns true if one branch is true, otherwise returns false SELECT VALUE FALSE OR FALSE;

The following table is the truth table for AND and OR.

A B A AND B A OR B
TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE
TRUE NULL NULL TRUE
TRUE MISSING MISSING TRUE
FALSE FALSE FALSE FALSE
FALSE NULL FALSE NULL
FALSE MISSING FALSE MISSING
NULL NULL NULL NULL
NULL MISSING MISSING NULL
MISSING MISSING MISSING MISSING

The following table demonstrates the results of NOT on all possible inputs.

A NOT A
TRUE FALSE
FALSE TRUE
NULL NULL
MISSING MISSING

Quantified Expressions

QuantifiedExpression ::= ( (<ANY>|<SOME>) | <EVERY> ) Variable <IN> Expression ( "," Variable "in" Expression )*
                         <SATISFIES> Expression (<END>)?

Quantified expressions are used for expressing existential or universal predicates involving the elements of a collection.

The following pair of examples illustrate the use of a quantified expression to test that every (or some) element in the set [1, 2, 3] of integers is less than three. The first example yields FALSE and second example yields TRUE.

It is useful to note that if the set were instead the empty set, the first expression would yield TRUE (“every” value in an empty set satisfies the condition) while the second expression would yield FALSE (since there isn’t “some” value, as there are no values in the set, that satisfies the condition).

A quantified expression will return a NULL (or MISSING) if the first expression in it evaluates to NULL (or MISSING). A type error will be raised if the first expression in a quantified expression does not return a collection.

Examples
EVERY x IN [ 1, 2, 3 ] SATISFIES x < 3
SOME x IN [ 1, 2, 3 ] SATISFIES x < 3

Path Expressions

PathExpression  ::= PrimaryExpression ( Field | Index )*
Field           ::= "." Identifier
Index           ::= "[" Expression "]"

Components of complex types in the data model are accessed via path expressions. Path access can be applied to the result of a query expression that yields an instance of a complex type, for example, a object or array instance. For objects, path access is based on field names. For arrays, path access is based on (zero-based) array-style indexing. Attempts to access non-existent fields or out-of-bound array elements produce the special value MISSING. For multisets path access is also zero-based and returns an arbitrary multiset element if the index is within the size of the multiset or MISSING otherwise. Type errors will be raised for inappropriate use of a path expression, such as applying a field accessor to a numeric value.

The following examples illustrate field access for a object, index-based element access for an array, and also a composition thereof.

Examples
({"name": "MyABCs", "array": [ "a", "b", "c"]}).array

(["a", "b", "c"])[2]

({"name": "MyABCs", "array": [ "a", "b", "c"]}).array[2]

Primary Expressions

PrimaryExpr ::= Literal
              | VariableReference
              | ParameterReference
              | ParenthesizedExpression
              | FunctionCallExpression
              | CaseExpression
              | Constructor

The most basic building block for any expression in the query langauge is PrimaryExpression. This can be a simple literal (constant) value, a reference to a query variable that is in scope, a parenthesized expression, a function call, or a newly constructed instance of the data model (such as a newly constructed object, array, or multiset of data model instances).

Literals

Literal        ::= StringLiteral
                   | IntegerLiteral
                   | FloatLiteral
                   | DoubleLiteral
                   | <NULL>
                   | <MISSING>
                   | <TRUE>
                   | <FALSE>
StringLiteral  ::= "\"" (
                             <EscapeQuot>
                           | <EscapeBslash>
                           | <EscapeSlash>
                           | <EscapeBspace>
                           | <EscapeFormf>
                           | <EscapeNl>
                           | <EscapeCr>
                           | <EscapeTab>
                           | ~["\"","\\"])*
                    "\""
                    | "\'"(
                             <EscapeApos>
                           | <EscapeBslash>
                           | <EscapeSlash>
                           | <EscapeBspace>
                           | <EscapeFormf>
                           | <EscapeNl>
                           | <EscapeCr>
                           | <EscapeTab>
                           | ~["\'","\\"])*
                      "\'"
<ESCAPE_Apos>  ::= "\\\'"
<ESCAPE_Quot>  ::= "\\\""
<EscapeBslash> ::= "\\\\"
<EscapeSlash>  ::= "\\/"
<EscapeBspace> ::= "\\b"
<EscapeFormf>  ::= "\\f"
<EscapeNl>     ::= "\\n"
<EscapeCr>     ::= "\\r"
<EscapeTab>    ::= "\\t"

IntegerLiteral ::= <DIGITS>
<DIGITS>       ::= ["0" - "9"]+
FloatLiteral   ::= <DIGITS> ( "f" | "F" )
                 | <DIGITS> ( "." <DIGITS> ( "f" | "F" ) )?
                 | "." <DIGITS> ( "f" | "F" )
DoubleLiteral  ::= <DIGITS> "." <DIGITS>
                   | "." <DIGITS>

Literals (constants) in a query can be strings, integers, floating point values, double values, boolean constants, or special constant values like NULL and MISSING. The NULL value is like a NULL in SQL; it is used to represent an unknown field value. The special value MISSING is only meaningful in the context of field accesses; it occurs when the accessed field simply does not exist at all in a object being accessed.

The following are some simple examples of literals.

Examples
'a string'
"test string"
42

Different from standard SQL, double quotes play the same role as single quotes and may be used for string literals in queries as well.

Variable References

VariableReference     ::= <IDENTIFIER> | <DelimitedIdentifier>
<IDENTIFIER>          ::= (<LETTER> | "_") (<LETTER> | <DIGIT> | "_" | "$")*
<LETTER>              ::= ["A" - "Z", "a" - "z"]
DelimitedIdentifier   ::= "`" (<EscapeQuot>
                                | <EscapeBslash>
                                | <EscapeSlash>
                                | <EscapeBspace>
                                | <EscapeFormf>
                                | <EscapeNl>
                                | <EscapeCr>
                                | <EscapeTab>
                                | ~["`","\\"])*
                          "`"

A variable in a query can be bound to any legal data model value. A variable reference refers to the value to which an in-scope variable is bound. (E.g., a variable binding may originate from one of the FROM, WITH or LET clauses of a SELECT statement or from an input parameter in the context of a function body.) Backticks, for example, `id`, are used for delimited identifiers. Delimiting is needed when a variable’s desired name clashes with a keyword or includes characters not allowed in regular identifiers. More information on exactly how variable references are resolved can be found in the appendix section on Variable Resolution.

Examples
tweet
id
`SELECT`
`my-function`

Parameter References

ParameterReference              ::= NamedParameterReference | PositionalParameterReference
NamedParameterReference         ::= "$" (<IDENTIFIER> | <DelimitedIdentifier>)
PositionalParameterReference    ::= ("$" <DIGITS>) | "?"

A statement parameter is an external variable which value is provided through the statement execution API. An error will be raised if the parameter is not bound at the query execution time. Positional parameter numbering starts at 1. “?” parameters are interpreted as $1, .. $N in the order in which they appear in the statement.

Examples
$id
$1
?

Parenthesized Expressions

ParenthesizedExpression ::= "(" Expression ")" | Subquery

An expression can be parenthesized to control the precedence order or otherwise clarify a query. For composability, a subquery is also an parenthesized expression.

The following expression evaluates to the value 2.

Example
( 1 + 1 )

Function Call Expressions

FunctionCallExpression ::= FunctionName "(" ( Expression ( "," Expression )* )? ")"

Functions are included in the query language, like most languages, as a way to package useful functionality or to componentize complicated or reusable computations. A function call is a legal query expression that represents the value resulting from the evaluation of its body expression with the given parameter bindings; the parameter value bindings can themselves be any expressions in the query language.

The following example is a (built-in) function call expression whose value is 8.

Example
length('a string')

Case Expressions

CaseExpression ::= SimpleCaseExpression | SearchedCaseExpression
SimpleCaseExpression ::= <CASE> Expression ( <WHEN> Expression <THEN> Expression )+ ( <ELSE> Expression )? <END>
SearchedCaseExpression ::= <CASE> ( <WHEN> Expression <THEN> Expression )+ ( <ELSE> Expression )? <END>

In a simple CASE expression, the query evaluator searches for the first WHENTHEN pair in which the WHEN expression is equal to the expression following CASE and returns the expression following THEN. If none of the WHENTHEN pairs meet this condition, and an ELSE branch exists, it returns the ELSE expression. Otherwise, NULL is returned.

In a searched CASE expression, the query evaluator searches from left to right until it finds a WHEN expression that is evaluated to TRUE, and then returns its corresponding THEN expression. If no condition is found to be TRUE, and an ELSE branch exists, it returns the ELSE expression. Otherwise, it returns NULL.

The following example illustrates the form of a case expression.

Example
CASE (2 < 3) WHEN true THEN "yes" ELSE "no" END

Constructors

Constructor              ::= ArrayConstructor | MultisetConstructor | ObjectConstructor
ArrayConstructor         ::= "[" ( Expression ( "," Expression )* )? "]"
MultisetConstructor      ::= "{{" ( Expression ( "," Expression )* )? "}}"
ObjectConstructor        ::= "{" ( FieldBinding ( "," FieldBinding )* )? "}"
FieldBinding             ::= Expression ":" Expression

A major feature of the query language is its ability to construct new data model instances. This is accomplished using its constructors for each of the model’s complex object structures, namely arrays, multisets, and objects. Arrays are like JSON arrays, while multisets have bag semantics. Objects are built from fields that are field-name/field-value pairs, again like JSON.

The following examples illustrate how to construct a new array with 4 items and a new object with 2 fields respectively. Array elements can be homogeneous (as in the first example), which is the common case, or they may be heterogeneous (as in the second example). The data values and field name values used to construct arrays, multisets, and objects in constructors are all simply query expressions. Thus, the collection elements, field names, and field values used in constructors can be simple literals or they can come from query variable references or even arbitrarily complex query expressions (subqueries). Type errors will be raised if the field names in an object are not strings, and duplicate field errors will be raised if they are not distinct.

Examples
[ 'a', 'b', 'c', 'c' ]

[ 42, "forty-two!", { "rank" : "Captain", "name": "America" }, 3.14159 ]

{
  'project name': 'Hyracks',
  'project members': [ 'vinayakb', 'dtabass', 'chenli', 'tsotras', 'tillw' ]
}

3. Queries

A query can be any legal expression or SELECT statement. A query always ends with a semicolon.

Query ::= (Expression | SelectStatement) ";"

Declarations

DatabaseDeclaration ::= "USE" Identifier

At the uppermost level, the world of data is organized into data namespaces called dataverses. To set the default dataverse for statements, the USE statement is provided.

As an example, the following statement sets the default dataverse to be “TinySocial”.

Example
USE TinySocial;

When writing a complex query, it can sometimes be helpful to define one or more auxilliary functions that each address a sub-piece of the overall query. The declare function statement supports the creation of such helper functions. In general, the function body (expression) can be any legal query expression.

FunctionDeclaration  ::= "DECLARE" "FUNCTION" Identifier ParameterList "{" Expression "}"
ParameterList        ::= "(" ( <VARIABLE> ( "," <VARIABLE> )* )? ")"

The following is a simple example of a temporary function definition and its use.

Example
DECLARE FUNCTION friendInfo(userId) {
    (SELECT u.id, u.name, len(u.friendIds) AS friendCount
     FROM GleambookUsers u
     WHERE u.id = userId)[0]
 };

SELECT VALUE friendInfo(2);

For our sample data set, this returns:

[
  { "id": 2, "name": "IsbelDull", "friendCount": 2 }
]

SELECT Statements

The following shows the (rich) grammar for the SELECT statement in the query language.

SelectStatement    ::= ( WithClause )?
                       SelectSetOperation (OrderbyClause )? ( LimitClause )?
SelectSetOperation ::= SelectBlock (<UNION> <ALL> ( SelectBlock | Subquery ) )*
Subquery           ::= "(" SelectStatement ")"

SelectBlock        ::= SelectClause
                       ( FromClause ( LetClause )?)?
                       ( WhereClause )?
                       ( GroupbyClause ( LetClause )? ( HavingClause )? )?
                       |
                       FromClause ( LetClause )?
                       ( WhereClause )?
                       ( GroupbyClause ( LetClause )? ( HavingClause )? )?
                       SelectClause

SelectClause       ::= <SELECT> ( <ALL> | <DISTINCT> )? ( SelectRegular | SelectValue )
SelectRegular      ::= Projection ( "," Projection )*
SelectValue        ::= ( <VALUE> | <ELEMENT> | <RAW> ) Expression
Projection         ::= ( Expression ( <AS> )? Identifier | "*" | Identifier "." "*" )

FromClause         ::= <FROM> FromTerm ( "," FromTerm )*
FromTerm           ::= Expression (( <AS> )? Variable)?
                       ( ( JoinType )? ( JoinClause | UnnestClause ) )*

JoinClause         ::= <JOIN> Expression (( <AS> )? Variable)? <ON> Expression
UnnestClause       ::= ( <UNNEST> ) Expression
                       ( <AS> )? Variable ( <AT> Variable )?
JoinType           ::= ( <INNER> | <LEFT> ( <OUTER> )? )

WithClause         ::= <WITH> WithElement ( "," WithElement )*
LetClause          ::= (<LET> | <LETTING>) LetElement ( "," LetElement )*
LetElement         ::= Variable "=" Expression
WithElement        ::= Variable <AS> Expression

WhereClause        ::= <WHERE> Expression

GroupbyClause      ::= <GROUP> <BY> Expression ( ( (<AS>)? Variable )?
                       ( "," Expression ( (<AS>)? Variable )? )* )
                       ( <GROUP> <AS> Variable
                         ("(" VariableReference <AS> Identifier
                         ("," VariableReference <AS> Identifier )* ")")?
                       )?
HavingClause       ::= <HAVING> Expression

OrderbyClause      ::= <ORDER> <BY> Expression ( <ASC> | <DESC> )?
                       ( "," Expression ( <ASC> | <DESC> )? )*
LimitClause        ::= <LIMIT> Expression ( <OFFSET> Expression )?

In this section, we will make use of two stored collections of objects (datasets), GleambookUsers and GleambookMessages, in a series of running examples to explain SELECT queries. The contents of the example collections are as follows:

GleambookUsers collection (or, dataset):

[ {
  "id":1,
  "alias":"Margarita",
  "name":"MargaritaStoddard",
  "nickname":"Mags",
  "userSince":"2012-08-20T10:10:00",
  "friendIds":[2,3,6,10],
  "employment":[{
                  "organizationName":"Codetechno",
                  "start-date":"2006-08-06"
                },
                {
                  "organizationName":"geomedia",
                  "start-date":"2010-06-17",
                  "end-date":"2010-01-26"
                }],
  "gender":"F"
},
{
  "id":2,
  "alias":"Isbel",
  "name":"IsbelDull",
  "nickname":"Izzy",
  "userSince":"2011-01-22T10:10:00",
  "friendIds":[1,4],
  "employment":[{
                  "organizationName":"Hexviafind",
                  "startDate":"2010-04-27"
               }]
},
{
  "id":3,
  "alias":"Emory",
  "name":"EmoryUnk",
  "userSince":"2012-07-10T10:10:00",
  "friendIds":[1,5,8,9],
  "employment":[{
                  "organizationName":"geomedia",
                  "startDate":"2010-06-17",
                  "endDate":"2010-01-26"
               }]
} ]

GleambookMessages collection (or, dataset):

[ {
  "messageId":2,
  "authorId":1,
  "inResponseTo":4,
  "senderLocation":[41.66,80.87],
  "message":" dislike x-phone its touch-screen is horrible"
},
{
  "messageId":3,
  "authorId":2,
  "inResponseTo":4,
  "senderLocation":[48.09,81.01],
  "message":" like product-y the plan is amazing"
},
{
  "messageId":4,
  "authorId":1,
  "inResponseTo":2,
  "senderLocation":[37.73,97.04],
  "message":" can't stand acast the network is horrible:("
},
{
  "messageId":6,
  "authorId":2,
  "inResponseTo":1,
  "senderLocation":[31.5,75.56],
  "message":" like product-z its platform is mind-blowing"
}
{
  "messageId":8,
  "authorId":1,
  "inResponseTo":11,
  "senderLocation":[40.33,80.87],
  "message":" like ccast the 3G is awesome:)"
},
{
  "messageId":10,
  "authorId":1,
  "inResponseTo":12,
  "senderLocation":[42.5,70.01],
  "message":" can't stand product-w the touch-screen is terrible"
},
{
  "messageId":11,
  "authorId":1,
  "inResponseTo":1,
  "senderLocation":[38.97,77.49],
  "message":" can't stand acast its plan is terrible"
} ]

SELECT Clause

The SELECT clause always returns a collection value as its result (even if the result is empty or a singleton).

Select Element/Value/Raw

The SELECT VALUE clause returns an array or multiset that contains the results of evaluating the VALUE expression, with one evaluation being performed per “binding tuple” (i.e., per FROM clause item) satisfying the statement’s selection criteria. For historical reasons the query language also allows the keywords ELEMENT or RAW to be used in place of VALUE (not recommended).

If there is no FROM clause, the expression after VALUE is evaluated once with no binding tuples (except those inherited from an outer environment).

Example
SELECT VALUE 1;

This query returns:

[
  1
]

The following example shows a query that selects one user from the GleambookUsers collection.

Example
SELECT VALUE user
FROM GleambookUsers user
WHERE user.id = 1;

This query returns:

[{
    "userSince": "2012-08-20T10:10:00.000Z",
    "friendIds": [
        2,
        3,
        6,
        10
    ],
    "gender": "F",
    "name": "MargaritaStoddard",
    "nickname": "Mags",
    "alias": "Margarita",
    "id": 1,
    "employment": [
        {
            "organizationName": "Codetechno",
            "start-date": "2006-08-06"
        },
        {
            "end-date": "2010-01-26",
            "organizationName": "geomedia",
            "start-date": "2010-06-17"
        }
    ]
} ]

SQL-style SELECT

The traditional SQL-style SELECT syntax is also supported in the query language. This syntax can also be reformulated in a SELECT VALUE based manner. (E.g., SELECT expA AS fldA, expB AS fldB is syntactic sugar for SELECT VALUE { 'fldA': expA, 'fldB': expB }.) Unlike in SQL, the result of a query does not preserve the order of expressions in the SELECT clause.

Example
SELECT user.alias user_alias, user.name user_name
FROM GleambookUsers user
WHERE user.id = 1;

Returns:

[ {
    "user_name": "MargaritaStoddard",
    "user_alias": "Margarita"
} ]

SELECT *

SELECT * returns an object with a nested field for each input tuple. Each field has as its field name the name of a binding variable generated by either the FROM clause or GROUP BY clause in the current enclosing SELECT statement, and its field value is the value of that binding variable.

Note that the result of SELECT * is different from the result of query that selects all the fields of an object.

Example
SELECT *
FROM GleambookUsers user;

Since user is the only binding variable generated in the FROM clause, this query returns:

[ {
    "user": {
        "userSince": "2012-08-20T10:10:00.000Z",
        "friendIds": [
            2,
            3,
            6,
            10
        ],
        "gender": "F",
        "name": "MargaritaStoddard",
        "nickname": "Mags",
        "alias": "Margarita",
        "id": 1,
        "employment": [
            {
                "organizationName": "Codetechno",
                "start-date": "2006-08-06"
            },
            {
                "end-date": "2010-01-26",
                "organizationName": "geomedia",
                "start-date": "2010-06-17"
            }
        ]
    }
}, {
    "user": {
        "userSince": "2011-01-22T10:10:00.000Z",
        "friendIds": [
            1,
            4
        ],
        "name": "IsbelDull",
        "nickname": "Izzy",
        "alias": "Isbel",
        "id": 2,
        "employment": [
            {
                "organizationName": "Hexviafind",
                "startDate": "2010-04-27"
            }
        ]
    }
}, {
    "user": {
        "userSince": "2012-07-10T10:10:00.000Z",
        "friendIds": [
            1,
            5,
            8,
            9
        ],
        "name": "EmoryUnk",
        "alias": "Emory",
        "id": 3,
        "employment": [
            {
                "organizationName": "geomedia",
                "endDate": "2010-01-26",
                "startDate": "2010-06-17"
            }
        ]
    }
} ]
Example
SELECT *
FROM GleambookUsers u, GleambookMessages m
WHERE m.authorId = u.id and u.id = 2;

This query does an inner join that we will discuss in multiple from terms. Since both u and m are binding variables generated in the FROM clause, this query returns:

[ {
    "u": {
        "userSince": "2011-01-22T10:10:00",
        "friendIds": [
            1,
            4
        ],
        "name": "IsbelDull",
        "nickname": "Izzy",
        "alias": "Isbel",
        "id": 2,
        "employment": [
            {
                "organizationName": "Hexviafind",
                "startDate": "2010-04-27"
            }
        ]
    },
    "m": {
        "senderLocation": [
            31.5,
            75.56
        ],
        "inResponseTo": 1,
        "messageId": 6,
        "authorId": 2,
        "message": " like product-z its platform is mind-blowing"
    }
}, {
    "u": {
        "userSince": "2011-01-22T10:10:00",
        "friendIds": [
            1,
            4
        ],
        "name": "IsbelDull",
        "nickname": "Izzy",
        "alias": "Isbel",
        "id": 2,
        "employment": [
            {
                "organizationName": "Hexviafind",
                "startDate": "2010-04-27"
            }
        ]
    },
    "m": {
        "senderLocation": [
            48.09,
            81.01
        ],
        "inResponseTo": 4,
        "messageId": 3,
        "authorId": 2,
        "message": " like product-y the plan is amazing"
    }
} ]

SELECT variable.*

Whereas SELECT * returns all the fields bound to all the variables which are currently defined, the notation SELECT c.* returns all the fields of the object bound to variable c. The variable c must be bound to an object for this to work.

Example
SELECT user.*
FROM GleambookUsers user;

Compare this query with the first example given under SELECT *. This query returns all users from the GleambookUsers dataset, but the user variable name is omitted from the results:

[
  {
    "id": 1,
    "alias": "Margarita",
    "name": "MargaritaStoddard",
    "nickname": "Mags",
    "userSince": "2012-08-20T10:10:00",
    "friendIds": [
      2,
      3,
      6,
      10
    ],
    "employment": [
      {
        "organizationName": "Codetechno",
        "start-date": "2006-08-06"
      },
      {
        "organizationName": "geomedia",
        "start-date": "2010-06-17",
        "end-date": "2010-01-26"
      }
    ],
    "gender": "F"
  },
  {
    "id": 2,
    "alias": "Isbel",
    "name": "IsbelDull",
    "nickname": "Izzy",
    "userSince": "2011-01-22T10:10:00",
    "friendIds": [
      1,
      4
    ],
    "employment": [
      {
        "organizationName": "Hexviafind",
        "startDate": "2010-04-27"
      }
    ]
  },
  {
    "id": 3,
    "alias": "Emory",
    "name": "EmoryUnk",
    "userSince": "2012-07-10T10:10:00",
    "friendIds": [
      1,
      5,
      8,
      9
    ],
    "employment": [
      {
        "organizationName": "geomedia",
        "startDate": "2010-06-17",
        "endDate": "2010-01-26"
      }
    ]
  }
]

SELECT DISTINCT

The DISTINCT keyword is used to eliminate duplicate items in results. The following example shows how it works.

Example
SELECT DISTINCT * FROM [1, 2, 2, 3] AS foo;

This query returns:

[ {
    "foo": 1
}, {
    "foo": 2
}, {
    "foo": 3
} ]
Example
SELECT DISTINCT VALUE foo FROM [1, 2, 2, 3] AS foo;

This version of the query returns:

[ 1
, 2
, 3
 ]

Unnamed Projections

Similar to standard SQL, the query language supports unnamed projections (a.k.a, unnamed SELECT clause items), for which names are generated. Name generation has three cases:

  • If a projection expression is a variable reference expression, its generated name is the name of the variable.
  • If a projection expression is a field access expression, its generated name is the last identifier in the expression.
  • For all other cases, the query processor will generate a unique name.
Example
SELECT substr(user.name, 10), user.alias
FROM GleambookUsers user
WHERE user.id = 1;

This query outputs:

[ {
    "alias": "Margarita",
    "$1": "Stoddard"
} ]

In the result, $1 is the generated name for substr(user.name, 1), while alias is the generated name for user.alias.

Abbreviated Field Access Expressions

As in standard SQL, field access expressions can be abbreviated (not recommended!) when there is no ambiguity. In the next example, the variable user is the only possible variable reference for fields id, name and alias and thus could be omitted in the query. More information on abbbreviated field access can be found in the appendix section on Variable Resolution.

Example
SELECT substr(name, 10) AS lname, alias
FROM GleambookUsers user
WHERE id = 1;

Outputs:

[ {
    "lname": "Stoddard",
    "alias": "Margarita"
} ]

UNNEST Clause

For each of its input tuples, the UNNEST clause flattens a collection-valued expression into individual items, producing multiple tuples, each of which is one of the expression’s original input tuples augmented with a flattened item from its collection.

Inner UNNEST

The following example is a query that retrieves the names of the organizations that a selected user has worked for. It uses the UNNEST clause to unnest the nested collection employment in the user’s object.

Example
SELECT u.id AS userId, e.organizationName AS orgName
FROM GleambookUsers u
UNNEST u.employment e
WHERE u.id = 1;

This query returns:

[ {
    "orgName": "Codetechno",
    "userId": 1
}, {
    "orgName": "geomedia",
    "userId": 1
} ]

Note that UNNEST has SQL’s inner join semantics — that is, if a user has no employment history, no tuple corresponding to that user will be emitted in the result.

Left Outer UNNEST

As an alternative, the LEFT OUTER UNNEST clause offers SQL’s left outer join semantics. For example, no collection-valued field named hobbies exists in the object for the user whose id is 1, but the following query’s result still includes user 1.

Example
SELECT u.id AS userId, h.hobbyName AS hobby
FROM GleambookUsers u
LEFT OUTER UNNEST u.hobbies h
WHERE u.id = 1;

Returns:

[ {
    "userId": 1
} ]

Note that if u.hobbies is an empty collection or leads to a MISSING (as above) or NULL value for a given input tuple, there is no corresponding binding value for variable h for an input tuple. A MISSING value will be generated for h so that the input tuple can still be propagated.

Expressing Joins Using UNNEST

The UNNEST clause is similar to SQL’s JOIN clause except that it allows its right argument to be correlated to its left argument, as in the examples above — i.e., think “correlated cross-product”. The next example shows this via a query that joins two data sets, GleambookUsers and GleambookMessages, returning user/message pairs. The results contain one object per pair, with result objects containing the user’s name and an entire message. The query can be thought of as saying “for each Gleambook user, unnest the GleambookMessages collection and filter the output with the condition message.authorId = user.id”.

Example
SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u
UNNEST GleambookMessages m
WHERE m.authorId = u.id;

This returns:

[ {
    "uname": "MargaritaStoddard",
    "message": " can't stand acast its plan is terrible"
}, {
    "uname": "MargaritaStoddard",
    "message": " dislike x-phone its touch-screen is horrible"
}, {
    "uname": "MargaritaStoddard",
    "message": " can't stand acast the network is horrible:("
}, {
    "uname": "MargaritaStoddard",
    "message": " like ccast the 3G is awesome:)"
}, {
    "uname": "MargaritaStoddard",
    "message": " can't stand product-w the touch-screen is terrible"
}, {
    "uname": "IsbelDull",
    "message": " like product-z its platform is mind-blowing"
}, {
    "uname": "IsbelDull",
    "message": " like product-y the plan is amazing"
} ]

Similarly, the above query can also be expressed as the UNNESTing of a correlated subquery:

Example
SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u
UNNEST (
    SELECT VALUE msg
    FROM GleambookMessages msg
    WHERE msg.authorId = u.id
) AS m;

FROM clauses

A FROM clause is used for enumerating (i.e., conceptually iterating over) the contents of collections, as in SQL.

Binding expressions

In addition to stored collections, a FROM clause can iterate over any intermediate collection returned by a valid query expression. In the tuple stream generated by a FROM clause, the ordering of the input tuples are not guaranteed to be preserved.

Example
SELECT VALUE foo
FROM [1, 2, 2, 3] AS foo
WHERE foo > 2;

Returns:

[
  3
]

Multiple FROM Terms

The query language permits correlations among FROM terms. Specifically, a FROM binding expression can refer to variables defined to its left in the given FROM clause. Thus, the first unnesting example above could also be expressed as follows:

Example
SELECT u.id AS userId, e.organizationName AS orgName
FROM GleambookUsers u, u.employment e
WHERE u.id = 1;

Expressing Joins Using FROM Terms

Similarly, the join intentions of the other UNNEST-based join examples above could be expressed as:

Example
SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u, GleambookMessages m
WHERE m.authorId = u.id;
Example
SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u,
  (
    SELECT VALUE msg
    FROM GleambookMessages msg
    WHERE msg.authorId = u.id
  ) AS m;

Note that the first alternative is one of the SQL-92 approaches to expressing a join.

Implicit Binding Variables

Similar to standard SQL, the query language supports implicit FROM binding variables (i.e., aliases), for which a binding variable is generated. Variable generation falls into three cases:

  • If the binding expression is a variable reference expression, the generated variable’s name will be the name of the referenced variable itself.
  • If the binding expression is a field access expression (or a fully qualified name for a dataset), the generated variable’s name will be the last identifier (or the dataset name) in the expression.
  • For all other cases, a compilation error will be raised.

The next two examples show queries that do not provide binding variables in their FROM clauses.

Example
SELECT GleambookUsers.name, GleambookMessages.message
FROM GleambookUsers, GleambookMessages
WHERE GleambookMessages.authorId = GleambookUsers.id;

Returns:

[ {
    "name": "MargaritaStoddard",
    "message": " like ccast the 3G is awesome:)"
}, {
    "name": "MargaritaStoddard",
    "message": " can't stand product-w the touch-screen is terrible"
}, {
    "name": "MargaritaStoddard",
    "message": " can't stand acast its plan is terrible"
}, {
    "name": "MargaritaStoddard",
    "message": " dislike x-phone its touch-screen is horrible"
}, {
    "name": "MargaritaStoddard",
    "message": " can't stand acast the network is horrible:("
}, {
    "name": "IsbelDull",
    "message": " like product-y the plan is amazing"
}, {
    "name": "IsbelDull",
    "message": " like product-z its platform is mind-blowing"
} ]
Example
SELECT GleambookUsers.name, GleambookMessages.message
FROM GleambookUsers,
  (
    SELECT VALUE GleambookMessages
    FROM GleambookMessages
    WHERE GleambookMessages.authorId = GleambookUsers.id
  );

Returns:

Error: "Syntax error: Need an alias for the enclosed expression:\n(select element GleambookMessages\n    from GleambookMessages as GleambookMessages\n    where (GleambookMessages.authorId = GleambookUsers.id)\n )",
    "query_from_user": "use TinySocial;\n\nSELECT GleambookUsers.name, GleambookMessages.message\n    FROM GleambookUsers,\n      (\n        SELECT VALUE GleambookMessages\n        FROM GleambookMessages\n        WHERE GleambookMessages.authorId = GleambookUsers.id\n      );"

More information on implicit binding variables can be found in the appendix section on Variable Resolution.

JOIN Clauses

The join clause in the query language supports both inner joins and left outer joins from standard SQL.

Inner joins

Using a JOIN clause, the inner join intent from the preceding examples can also be expressed as follows:

Example
SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u JOIN GleambookMessages m ON m.authorId = u.id;

Left Outer Joins

The query language supports SQL’s notion of left outer join. The following query is an example:

SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u LEFT OUTER JOIN GleambookMessages m ON m.authorId = u.id;

Returns:

[ {
    "uname": "MargaritaStoddard",
    "message": " like ccast the 3G is awesome:)"
}, {
    "uname": "MargaritaStoddard",
    "message": " can't stand product-w the touch-screen is terrible"
}, {
    "uname": "MargaritaStoddard",
    "message": " can't stand acast its plan is terrible"
}, {
    "uname": "MargaritaStoddard",
    "message": " dislike x-phone its touch-screen is horrible"
}, {
    "uname": "MargaritaStoddard",
    "message": " can't stand acast the network is horrible:("
}, {
    "uname": "IsbelDull",
    "message": " like product-y the plan is amazing"
}, {
    "uname": "IsbelDull",
    "message": " like product-z its platform is mind-blowing"
}, {
    "uname": "EmoryUnk"
} ]

For non-matching left-side tuples, the query language produces MISSING values for the right-side binding variables; that is why the last object in the above result doesn’t have a message field. Note that this is slightly different from standard SQL, which instead would fill in NULL values for the right-side fields. The reason for this difference is that, for non-matches in its join results, the query language views fields from the right-side as being “not there” (a.k.a. MISSING) instead of as being “there but unknown” (i.e., NULL).

The left-outer join query can also be expressed using LEFT OUTER UNNEST:

SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u
LEFT OUTER UNNEST (
    SELECT VALUE message
    FROM GleambookMessages message
    WHERE message.authorId = u.id
  ) m;

In general, SQL-style join queries can also be expressed by UNNEST clauses and left outer join queries can be expressed by LEFT OUTER UNNESTs.

Variable scope in JOIN clauses

Variables defined by JOIN subclauses are not visible to other subclauses in the same FROM clause. This also applies to the FROM variable that starts the JOIN subclause.

Example
SELECT * FROM GleambookUsers u
JOIN (SELECT VALUE m
      FROM GleambookMessages m
      WHERE m.authorId = u.id) m
ON u.id = m.authorId;

The variable u defined by the FROM clause is not visible inside the JOIN subclause, so this query returns no results.

GROUP BY Clauses

The GROUP BY clause generalizes standard SQL’s grouping and aggregation semantics, but it also retains backward compatibility with the standard (relational) SQL GROUP BY and aggregation features.

Group variables

In a GROUP BY clause, in addition to the binding variable(s) defined for the grouping key(s), the query language allows a user to define a group variable by using the clause’s GROUP AS extension to denote the resulting group. After grouping, then, the query’s in-scope variables include the grouping key’s binding variables as well as this group variable which will be bound to one collection value for each group. This per-group collection (i.e., multiset) value will be a set of nested objects in which each field of the object is the result of a renamed variable defined in parentheses following the group variable’s name. The GROUP AS syntax is as follows:

<GROUP> <AS> Variable ("(" VariableReference <AS> Identifier ("," VariableReference <AS> Identifier )* ")")?
Example
SELECT *
FROM GleambookMessages message
GROUP BY message.authorId AS uid GROUP AS msgs(message AS msg);

This first example query returns:

[ {
    "msgs": [
        {
            "msg": {
                "senderLocation": [
                    38.97,
                    77.49
                ],
                "inResponseTo": 1,
                "messageId": 11,
                "authorId": 1,
                "message": " can't stand acast its plan is terrible"
            }
        },
        {
            "msg": {
                "senderLocation": [
                    41.66,
                    80.87
                ],
                "inResponseTo": 4,
                "messageId": 2,
                "authorId": 1,
                "message": " dislike x-phone its touch-screen is horrible"
            }
        },
        {
            "msg": {
                "senderLocation": [
                    37.73,
                    97.04
                ],
                "inResponseTo": 2,
                "messageId": 4,
                "authorId": 1,
                "message": " can't stand acast the network is horrible:("
            }
        },
        {
            "msg": {
                "senderLocation": [
                    40.33,
                    80.87
                ],
                "inResponseTo": 11,
                "messageId": 8,
                "authorId": 1,
                "message": " like ccast the 3G is awesome:)"
            }
        },
        {
            "msg": {
                "senderLocation": [
                    42.5,
                    70.01
                ],
                "inResponseTo": 12,
                "messageId": 10,
                "authorId": 1,
                "message": " can't stand product-w the touch-screen is terrible"
            }
        }
    ],
    "uid": 1
}, {
    "msgs": [
        {
            "msg": {
                "senderLocation": [
                    31.5,
                    75.56
                ],
                "inResponseTo": 1,
                "messageId": 6,
                "authorId": 2,
                "message": " like product-z its platform is mind-blowing"
            }
        },
        {
            "msg": {
                "senderLocation": [
                    48.09,
                    81.01
                ],
                "inResponseTo": 4,
                "messageId": 3,
                "authorId": 2,
                "message": " like product-y the plan is amazing"
            }
        }
    ],
    "uid": 2
} ]

As we can see from the above query result, each group in the example query’s output has an associated group variable value called msgs that appears in the SELECT *’s result. This variable contains a collection of objects associated with the group; each of the group’s message values appears in the msg field of the objects in the msgs collection.

The group variable in the query language makes more complex, composable, nested subqueries over a group possible, which is important given the language’s more complex data model (relative to SQL). As a simple example of this, as we really just want the messages associated with each user, we might wish to avoid the “extra wrapping” of each message as the msg field of an object. (That wrapping is useful in more complex cases, but is essentially just in the way here.) We can use a subquery in the SELECT clause to tunnel through the extra nesting and produce the desired result.

Example
SELECT uid, (SELECT VALUE g.msg FROM g) AS msgs
FROM GleambookMessages gbm
GROUP BY gbm.authorId AS uid
GROUP AS g(gbm as msg);

This variant of the example query returns:

   [ {
       "msgs": [
           {
               "senderLocation": [
                   38.97,
                   77.49
               ],
               "inResponseTo": 1,
               "messageId": 11,
               "authorId": 1,
               "message": " can't stand acast its plan is terrible"
           },
           {
               "senderLocation": [
                   41.66,
                   80.87
               ],
               "inResponseTo": 4,
               "messageId": 2,
               "authorId": 1,
               "message": " dislike x-phone its touch-screen is horrible"
           },
           {
               "senderLocation": [
                   37.73,
                   97.04
               ],
               "inResponseTo": 2,
               "messageId": 4,
               "authorId": 1,
               "message": " can't stand acast the network is horrible:("
           },
           {
               "senderLocation": [
                   40.33,
                   80.87
               ],
               "inResponseTo": 11,
               "messageId": 8,
               "authorId": 1,
               "message": " like ccast the 3G is awesome:)"
           },
           {
               "senderLocation": [
                   42.5,
                   70.01
               ],
               "inResponseTo": 12,
               "messageId": 10,
               "authorId": 1,
               "message": " can't stand product-w the touch-screen is terrible"
           }
       ],
       "uid": 1
   }, {
       "msgs": [
           {
               "senderLocation": [
                   31.5,
                   75.56
               ],
               "inResponseTo": 1,
               "messageId": 6,
               "authorId": 2,
               "message": " like product-z its platform is mind-blowing"
           },
           {
               "senderLocation": [
                   48.09,
                   81.01
               ],
               "inResponseTo": 4,
               "messageId": 3,
               "authorId": 2,
               "message": " like product-y the plan is amazing"
           }
       ],
       "uid": 2
   } ]

The next example shows a more interesting case involving the use of a subquery in the SELECT list. Here the subquery further processes the groups. There is no renaming in the declaration of the group variable g such that g only has one field gbm which comes from the FROM clause.

Example
SELECT uid,
       (SELECT VALUE g.gbm
        FROM g
        WHERE g.gbm.message LIKE '% like%'
        ORDER BY g.gbm.messageId
        LIMIT 2) AS msgs
FROM GleambookMessages gbm
GROUP BY gbm.authorId AS uid
GROUP AS g;

This example query returns:

[ {
    "msgs": [
        {
            "senderLocation": [
                40.33,
                80.87
            ],
            "inResponseTo": 11,
            "messageId": 8,
            "authorId": 1,
            "message": " like ccast the 3G is awesome:)"
        }
    ],
    "uid": 1
}, {
    "msgs": [
        {
            "senderLocation": [
                48.09,
                81.01
            ],
            "inResponseTo": 4,
            "messageId": 3,
            "authorId": 2,
            "message": " like product-y the plan is amazing"
        },
        {
            "senderLocation": [
                31.5,
                75.56
            ],
            "inResponseTo": 1,
            "messageId": 6,
            "authorId": 2,
            "message": " like product-z its platform is mind-blowing"
        }
    ],
    "uid": 2
} ]

Implicit Grouping Key Variables

In the query language syntax, providing named binding variables for GROUP BY key expressions is optional. If a grouping key is missing a user-provided binding variable, the underlying compiler will generate one. Automatic grouping key variable naming falls into three cases, much like the treatment of unnamed projections:

  • If the grouping key expression is a variable reference expression, the generated variable gets the same name as the referred variable;
  • If the grouping key expression is a field access expression, the generated variable gets the same name as the last identifier in the expression;
  • For all other cases, the compiler generates a unique variable (but the user query is unable to refer to this generated variable).

The next example illustrates a query that doesn’t provide binding variables for its grouping key expressions.

Example
SELECT authorId,
       (SELECT VALUE g.gbm
        FROM g
        WHERE g.gbm.message LIKE '% like%'
        ORDER BY g.gbm.messageId
        LIMIT 2) AS msgs
FROM GleambookMessages gbm
GROUP BY gbm.authorId
GROUP AS g;

This query returns:

    [ {
    "msgs": [
        {
            "senderLocation": [
                40.33,
                80.87
            ],
            "inResponseTo": 11,
            "messageId": 8,
            "authorId": 1,
            "message": " like ccast the 3G is awesome:)"
        }
    ],
    "authorId": 1
}, {
    "msgs": [
        {
            "senderLocation": [
                48.09,
                81.01
            ],
            "inResponseTo": 4,
            "messageId": 3,
            "authorId": 2,
            "message": " like product-y the plan is amazing"
        },
        {
            "senderLocation": [
                31.5,
                75.56
            ],
            "inResponseTo": 1,
            "messageId": 6,
            "authorId": 2,
            "message": " like product-z its platform is mind-blowing"
        }
    ],
    "authorId": 2
} ]

Based on the three variable generation rules, the generated variable for the grouping key expression message.authorId is authorId (which is how it is referred to in the example’s SELECT clause).

Implicit Group Variables

The group variable itself is also optional in the GROUP BY syntax. If a user’s query does not declare the name and structure of the group variable using GROUP AS, the query compiler will generate a unique group variable whose fields include all of the binding variables defined in the FROM clause of the current enclosing SELECT statement. In this case the user’s query will not be able to refer to the generated group variable, but is able to call SQL-92 aggregation functions as in SQL-92.

Aggregation Functions

In the traditional SQL, which doesn’t support nested data, grouping always also involves the use of aggregation to compute properties of the groups (for example, the average number of messages per user rather than the actual set of messages per user). Each aggregation function in the query language takes a collection (for example, the group of messages) as its input and produces a scalar value as its output. These aggregation functions, being truly functional in nature (unlike in SQL), can be used anywhere in a query where an expression is allowed. The following table catalogs the built-in aggregation functions of the query language and also indicates how each one handles NULL/MISSING values in the input collection or a completely empty input collection:

Function NULL MISSING Empty Collection
STRICT_COUNT counted counted 0
STRICT_SUM returns NULL returns NULL returns NULL
STRICT_MAX returns NULL returns NULL returns NULL
STRICT_MIN returns NULL returns NULL returns NULL
STRICT_AVG returns NULL returns NULL returns NULL
ARRAY_COUNT not counted not counted 0
ARRAY_SUM ignores NULL ignores NULL returns NULL
ARRAY_MAX ignores NULL ignores NULL returns NULL
ARRAY_MIN ignores NULL ignores NULL returns NULL
ARRAY_AVG ignores NULL ignores NULL returns NULL

Notice that the query language has twice as many functions listed above as there are aggregate functions in SQL-92. This is because the language offers two versions of each – one that handles UNKNOWN values in a semantically strict fashion, where unknown values in the input result in unknown values in the output – and one that handles them in the ad hoc “just ignore the unknown values” fashion that the SQL standard chose to adopt.

Example
ARRAY_AVG(
    (
      SELECT VALUE ARRAY_COUNT(friendIds) FROM GleambookUsers
    )
);

This example returns:

3.3333333333333335
Example
SELECT uid AS uid, ARRAY_COUNT(grp) AS msgCnt
FROM GleambookMessages message
GROUP BY message.authorId AS uid
GROUP AS grp(message AS msg);

This query returns:

[ {
    "uid": 1,
    "msgCnt": 5
}, {
    "uid": 2,
    "msgCnt": 2
} ]

Notice how the query forms groups where each group involves a message author and their messages. (SQL cannot do this because the grouped intermediate result is non-1NF in nature.) The query then uses the collection aggregate function ARRAY_COUNT to get the cardinality of each group of messages.

Each aggregation function in the query language supports DISTINCT modifier that removes duplicate values from the input collection.

Example
ARRAY_SUM(DISTINCT [1, 1, 2, 2, 3])

This query returns:

6

SQL-92 Aggregation Functions

For compatibility with the traditional SQL aggregation functions, the query language also offers SQL-92’s aggregation function symbols (COUNT, SUM, MAX, MIN, and AVG) as supported syntactic sugar. The query compiler rewrites queries that utilize these function symbols into queries that only use the collection aggregate functions of the query language. The following example uses the SQL-92 syntax approach to compute a result that is identical to that of the more explicit example above:

Example
SELECT uid, COUNT(*) AS msgCnt
FROM GleambookMessages msg
GROUP BY msg.authorId AS uid;

It is important to realize that COUNT is actually not a built-in aggregation function. Rather, the COUNT query above is using a special “sugared” function symbol that the query compiler will rewrite as follows:

SELECT uid AS uid, ARRAY_COUNT( (SELECT VALUE 1 FROM `$1` as g) ) AS msgCnt
FROM GleambookMessages msg
GROUP BY msg.authorId AS uid
GROUP AS `$1`(msg AS msg);

The same sort of rewritings apply to the function symbols SUM, MAX, MIN, and AVG. In contrast to the collection aggregate functions of the query language, these special SQL-92 function symbols can only be used in the same way they are in standard SQL (i.e., with the same restrictions).

DISTINCT modifier is also supported for these aggregate functions.

SQL-92 Compliant GROUP BY Aggregations

The query language provides full support for SQL-92 GROUP BY aggregation queries. The following query is such an example:

Example
SELECT msg.authorId, COUNT(*)
FROM GleambookMessages msg
GROUP BY msg.authorId;

This query outputs:

[ {
    "authorId": 1,
    "$1": 5
}, {
    "authorId": 2,
    "$1": 2
} ]

In principle, a msg reference in the query’s SELECT clause would be “sugarized” as a collection (as described in Implicit Group Variables). However, since the SELECT expression msg.authorId is syntactically identical to a GROUP BY key expression, it will be internally replaced by the generated group key variable. The following is the equivalent rewritten query that will be generated by the compiler for the query above:

SELECT authorId AS authorId, ARRAY_COUNT( (SELECT g.msg FROM `$1` AS g) )
FROM GleambookMessages msg
GROUP BY msg.authorId AS authorId
GROUP AS `$1`(msg AS msg);

Column Aliases

The query language also allows column aliases to be used as ORDER BY keys.

Example
SELECT msg.authorId AS aid, COUNT(*)
FROM GleambookMessages msg
GROUP BY msg.authorId;
ORDER BY aid;

This query returns:

[ {
    "$1": 5,
    "aid": 1
}, {
    "$1": 2,
    "aid": 2
} ]

WHERE Clauses and HAVING Clauses

Both WHERE clauses and HAVING clauses are used to filter input data based on a condition expression. Only tuples for which the condition expression evaluates to TRUE are propagated. Note that if the condition expression evaluates to NULL or MISSING the input tuple will be disgarded.

ORDER BY Clauses

The ORDER BY clause is used to globally sort data in either ascending order (i.e., ASC) or descending order (i.e., DESC). During ordering, MISSING and NULL are treated as being smaller than any other value if they are encountered in the ordering key(s). MISSING is treated as smaller than NULL if both occur in the data being sorted. The ordering of values of a given type is consistent with its type’s <= ordering; the ordering of values across types is implementation-defined but stable. The following example returns all GleambookUsers in descending order by their number of friends.

Example
  SELECT VALUE user
  FROM GleambookUsers AS user
  ORDER BY ARRAY_COUNT(user.friendIds) DESC;

This query returns:

  [ {
      "userSince": "2012-08-20T10:10:00.000Z",
      "friendIds": [
          2,
          3,
          6,
          10
      ],
      "gender": "F",
      "name": "MargaritaStoddard",
      "nickname": "Mags",
      "alias": "Margarita",
      "id": 1,
      "employment": [
          {
              "organizationName": "Codetechno",
              "start-date": "2006-08-06"
          },
          {
              "end-date": "2010-01-26",
              "organizationName": "geomedia",
              "start-date": "2010-06-17"
          }
      ]
  }, {
      "userSince": "2012-07-10T10:10:00.000Z",
      "friendIds": [
          1,
          5,
          8,
          9
      ],
      "name": "EmoryUnk",
      "alias": "Emory",
      "id": 3,
      "employment": [
          {
              "organizationName": "geomedia",
              "endDate": "2010-01-26",
              "startDate": "2010-06-17"
          }
      ]
  }, {
      "userSince": "2011-01-22T10:10:00.000Z",
      "friendIds": [
          1,
          4
      ],
      "name": "IsbelDull",
      "nickname": "Izzy",
      "alias": "Isbel",
      "id": 2,
      "employment": [
          {
              "organizationName": "Hexviafind",
              "startDate": "2010-04-27"
          }
      ]
  } ]

LIMIT Clauses

The LIMIT clause is used to limit the result set to a specified constant size. The use of the LIMIT clause is illustrated in the next example.

Example
  SELECT VALUE user
  FROM GleambookUsers AS user
  ORDER BY len(user.friendIds) DESC
  LIMIT 1;

This query returns:

  [ {
      "userSince": "2012-08-20T10:10:00.000Z",
      "friendIds": [
          2,
          3,
          6,
          10
      ],
      "gender": "F",
      "name": "MargaritaStoddard",
      "nickname": "Mags",
      "alias": "Margarita",
      "id": 1,
      "employment": [
          {
              "organizationName": "Codetechno",
              "start-date": "2006-08-06"
          },
          {
              "end-date": "2010-01-26",
              "organizationName": "geomedia",
              "start-date": "2010-06-17"
          }
      ]
  } ]

WITH Clauses

As in standard SQL, WITH clauses are available to improve the modularity of a query. The next query shows an example.

Example
WITH avgFriendCount AS (
  SELECT VALUE AVG(ARRAY_COUNT(user.friendIds))
  FROM GleambookUsers AS user
)[0]
SELECT VALUE user
FROM GleambookUsers user
WHERE ARRAY_COUNT(user.friendIds) > avgFriendCount;

This query returns:

[ {
    "userSince": "2012-08-20T10:10:00.000Z",
    "friendIds": [
        2,
        3,
        6,
        10
    ],
    "gender": "F",
    "name": "MargaritaStoddard",
    "nickname": "Mags",
    "alias": "Margarita",
    "id": 1,
    "employment": [
        {
            "organizationName": "Codetechno",
            "start-date": "2006-08-06"
        },
        {
            "end-date": "2010-01-26",
            "organizationName": "geomedia",
            "start-date": "2010-06-17"
        }
    ]
}, {
    "userSince": "2012-07-10T10:10:00.000Z",
    "friendIds": [
        1,
        5,
        8,
        9
    ],
    "name": "EmoryUnk",
    "alias": "Emory",
    "id": 3,
    "employment": [
        {
            "organizationName": "geomedia",
            "endDate": "2010-01-26",
            "startDate": "2010-06-17"
        }
    ]
} ]

The query is equivalent to the following, more complex, inlined form of the query:

SELECT *
FROM GleambookUsers user
WHERE ARRAY_COUNT(user.friendIds) >
    ( SELECT VALUE AVG(ARRAY_COUNT(user.friendIds))
      FROM GleambookUsers AS user
    ) [0];

WITH can be particularly useful when a value needs to be used several times in a query.

Before proceeding further, notice that both the WITH query and its equivalent inlined variant include the syntax “[0]” – this is due to a noteworthy difference between the query language and SQL-92. In SQL-92, whenever a scalar value is expected and it is being produced by a query expression, the SQL-92 query processor will evaluate the expression, check that there is only one row and column in the result at runtime, and then coerce the one-row/one-column tabular result into a scalar value. A JSON query language, being designed to deal with nested data and schema-less data, should not do this. Collection-valued data is perfectly legal in most contexts, and its data is schema-less, so the query processor rarely knows exactly what to expect where and such automatic conversion would often not be desirable. Thus, in the queries above, the use of “[0]” extracts the first (i.e., 0th) element of an array-valued query expression’s result; this is needed above, even though the result is an array of one element, to extract the only element in the singleton array and obtain the desired scalar for the comparison.

LET Clauses

Similar to WITH clauses, LET clauses can be useful when a (complex) expression is used several times within a query, allowing it to be written once to make the query more concise. The next query shows an example.

Example
SELECT u.name AS uname, messages AS messages
FROM GleambookUsers u
LET messages = (SELECT VALUE m
                FROM GleambookMessages m
                WHERE m.authorId = u.id)
WHERE EXISTS messages;

This query lists GleambookUsers that have posted GleambookMessages and shows all authored messages for each listed user. It returns:

[ {
    "uname": "MargaritaStoddard",
    "messages": [
        {
            "senderLocation": [
                38.97,
                77.49
            ],
            "inResponseTo": 1,
            "messageId": 11,
            "authorId": 1,
            "message": " can't stand acast its plan is terrible"
        },
        {
            "senderLocation": [
                41.66,
                80.87
            ],
            "inResponseTo": 4,
            "messageId": 2,
            "authorId": 1,
            "message": " dislike x-phone its touch-screen is horrible"
        },
        {
            "senderLocation": [
                37.73,
                97.04
            ],
            "inResponseTo": 2,
            "messageId": 4,
            "authorId": 1,
            "message": " can't stand acast the network is horrible:("
        },
        {
            "senderLocation": [
                40.33,
                80.87
            ],
            "inResponseTo": 11,
            "messageId": 8,
            "authorId": 1,
            "message": " like ccast the 3G is awesome:)"
        },
        {
            "senderLocation": [
                42.5,
                70.01
            ],
            "inResponseTo": 12,
            "messageId": 10,
            "authorId": 1,
            "message": " can't stand product-w the touch-screen is terrible"
        }
    ]
}, {
    "uname": "IsbelDull",
    "messages": [
        {
            "senderLocation": [
                31.5,
                75.56
            ],
            "inResponseTo": 1,
            "messageId": 6,
            "authorId": 2,
            "message": " like product-z its platform is mind-blowing"
        },
        {
            "senderLocation": [
                48.09,
                81.01
            ],
            "inResponseTo": 4,
            "messageId": 3,
            "authorId": 2,
            "message": " like product-y the plan is amazing"
        }
    ]
} ]

This query is equivalent to the following query that does not use the LET clause:

SELECT u.name AS uname, ( SELECT VALUE m
                          FROM GleambookMessages m
                          WHERE m.authorId = u.id
                        ) AS messages
FROM GleambookUsers u
WHERE EXISTS ( SELECT VALUE m
               FROM GleambookMessages m
               WHERE m.authorId = u.id
             );

UNION ALL

UNION ALL can be used to combine two input arrays or multisets into one. As in SQL, there is no ordering guarantee on the contents of the output stream. However, unlike SQL, the query language does not constrain what the data looks like on the input streams; in particular, it allows heterogenity on the input and output streams. A type error will be raised if one of the inputs is not a collection. The following odd but legal query is an example:

Example
SELECT u.name AS uname
FROM GleambookUsers u
WHERE u.id = 2
  UNION ALL
SELECT VALUE m.message
FROM GleambookMessages m
WHERE authorId=2;

This query returns:

[
  " like product-z its platform is mind-blowing"
  , {
    "uname": "IsbelDull"
}, " like product-y the plan is amazing"
 ]

Subqueries

In the query language, an arbitrary subquery can appear anywhere that an expression can appear. Unlike SQL-92, as was just alluded to, the subqueries in a SELECT list or a boolean predicate need not return singleton, single-column relations. Instead, they may return arbitrary collections. For example, the following query is a variant of the prior group-by query examples; it retrieves an array of up to two “dislike” messages per user.

Example
SELECT uid,
       (SELECT VALUE m.msg
        FROM msgs m
        WHERE m.msg.message LIKE '%dislike%'
        ORDER BY m.msg.messageId
        LIMIT 2) AS msgs
FROM GleambookMessages message
GROUP BY message.authorId AS uid GROUP AS msgs(message AS msg);

For our sample data set, this query returns:

[ {
    "msgs": [
        {
            "senderLocation": [
                41.66,
                80.87
            ],
            "inResponseTo": 4,
            "messageId": 2,
            "authorId": 1,
            "message": " dislike x-phone its touch-screen is horrible"
        }
    ],
    "uid": 1
}, {
    "msgs": [

    ],
    "uid": 2
} ]

Note that a subquery, like a top-level SELECT statment, always returns a collection – regardless of where within a query the subquery occurs – and again, its result is never automatically cast into a scalar.

Differences from SQL-92

The query language offers the following additional features beyond SQL-92:

  • Fully composable and functional: A subquery can iterate over any intermediate collection and can appear anywhere in a query.
  • Schema-free: The query language does not assume the existence of a static schema for any data that it processes.
  • Correlated FROM terms: A right-side FROM term expression can refer to variables defined by FROM terms on its left.
  • Powerful GROUP BY: In addition to a set of aggregate functions as in standard SQL, the groups created by the GROUP BY clause are directly usable in nested queries and/or to obtain nested results.
  • Generalized SELECT clause: A SELECT clause can return any type of collection, while in SQL-92, a SELECT clause has to return a (homogeneous) collection of objects.

The following matrix is a quick “SQL-92 compatibility cheat sheet” for the query language.

Feature The query language SQL-92 Why different?
SELECT * Returns nested objects Returns flattened concatenated objects Nested collections are 1st class citizens
SELECT list order not preserved order preserved Fields in a JSON object are not ordered
Subquery Returns a collection The returned collection is cast into a scalar value if the subquery appears in a SELECT list or on one side of a comparison or as input to a function Nested collections are 1st class citizens
LEFT OUTER JOIN Fills in MISSING(s) for non-matches Fills in NULL(s) for non-matches “Absence” is more appropriate than “unknown” here
UNION ALL Allows heterogeneous inputs and output Input streams must be UNION-compatible and output field names are drawn from the first input stream Heterogenity and nested collections are common
IN constant_expr The constant expression has to be an array or multiset, i.e., [..,..,…] The constant collection can be represented as comma-separated items in a paren pair Nested collections are 1st class citizens
String literal Double quotes or single quotes Single quotes only Double quoted strings are pervasive
Delimited identifiers Backticks Double quotes Double quoted strings are pervasive

The following SQL-92 features are not implemented yet. However, the query language does not conflict with these features:

  • CROSS JOIN, NATURAL JOIN, UNION JOIN
  • RIGHT and FULL OUTER JOIN
  • INTERSECT, EXCEPT, UNION with set semantics
  • CAST expression
  • COALESCE expression
  • ALL and SOME predicates for linking to subqueries
  • UNIQUE predicate (tests a collection for duplicates)
  • MATCH predicate (tests for referential integrity)
  • Row and Table constructors
  • Preserved order for expressions in a SELECT list

4. Errors

A query can potentially result in one of the following errors:

  • syntax error,
  • identifier resolution error,
  • type error,
  • resource error.

If the query processor runs into any error, it will terminate the ongoing processing of the query and immediately return an error message to the client.

Syntax Errors

A valid query must satisfy the grammar rules of the query language. Otherwise, a syntax error will be raised.

Example
SELECT *
GleambookUsers user

Since the query misses a FROM keyword before the dataset GleambookUsers, we will get a syntax error as follows:

Syntax error: In line 2 >>GleambookUsers user;<< Encountered <IDENTIFIER> \"GleambookUsers\" at column 1.
Example
SELECT *
FROM GleambookUsers user
WHERE type="advertiser";

Since “type” is a reserved keyword in the query parser, we will get a syntax error as follows:

Error: Syntax error: In line 3 >>WHERE type="advertiser";<< Encountered 'type' "type" at column 7.
==> WHERE type="advertiser";

Identifier Resolution Errors

Referring to an undefined identifier can cause an error if the identifier cannot be successfully resolved as a valid field access.

Example
SELECT *
FROM GleambookUser user;

If we have a typo as above in “GleambookUsers” that misses the dataset name’s ending “s”, we will get an identifier resolution error as follows:

Error: Cannot find dataset GleambookUser in dataverse Default nor an alias with name GleambookUser!
Example
SELECT name, message
FROM GleambookUsers u JOIN GleambookMessages m ON m.authorId = u.id;

If the compiler cannot figure out how to resolve an unqualified field name, which will occur if there is more than one variable in scope (e.g., GleambookUsers u and GleambookMessages m as above), we will get an identifier resolution error as follows:

Error: Cannot resolve ambiguous alias reference for undefined identifier name

Type Errors

The query compiler does type checks based on its available type information. In addition, the query runtime also reports type errors if a data model instance it processes does not satisfy the type requirement.

Example
abs("123");

Since function abs can only process numeric input values, we will get a type error as follows:

Error: Type mismatch: function abs expects its 1st input parameter to be of type tinyint, smallint, integer, bigint, float or double, but the actual input type is string

Resource Errors

A query can potentially exhaust system resources, such as the number of open files and disk spaces. For instance, the following two resource errors could be potentially be seen when running the system:

Error: no space left on device
Error: too many open files

The “no space left on device” issue usually can be fixed by cleaning up disk spaces and reserving more disk spaces for the system. The “too many open files” issue usually can be fixed by a system administrator, following the instructions here.

5. DDL and DML statements

Statement ::= ( ( SingleStatement )? ( ";" )+ )* <EOF>
SingleStatement ::= DatabaseDeclaration
                  | FunctionDeclaration
                  | CreateStatement
                  | DropStatement
                  | LoadStatement
                  | SetStatement
                  | InsertStatement
                  | DeleteStatement
                  | Query

In addition to queries, an implementation of the query language needs to support statements for data definition and manipulation purposes as well as controlling the context to be used in evaluating query expressions. This section details the DDL and DML statements supported in the query language as realized today in Apache AsterixDB.

Lifecycle Management Statements

CreateStatement ::= "CREATE" ( DatabaseSpecification
                             | TypeSpecification
                             | DatasetSpecification
                             | IndexSpecification
                             | FunctionSpecification )

QualifiedName       ::= Identifier ( "." Identifier )?
DoubleQualifiedName ::= Identifier "." Identifier ( "." Identifier )?

The CREATE statement is used for creating dataverses as well as other persistent artifacts in a dataverse. It can be used to create new dataverses, datatypes, datasets, indexes, and user-defined query functions.

Dataverses

DatabaseSpecification ::= "DATAVERSE" Identifier IfNotExists

The CREATE DATAVERSE statement is used to create new dataverses. To ease the authoring of reusable query scripts, an optional IF NOT EXISTS clause is included to allow creation to be requested either unconditionally or only if the dataverse does not already exist. If this clause is absent, an error is returned if a dataverse with the indicated name already exists.

The following example creates a new dataverse named TinySocial if one does not already exist.

Example
CREATE DATAVERSE TinySocial IF NOT EXISTS;

Types

TypeSpecification    ::= "TYPE" FunctionOrTypeName IfNotExists "AS" ObjectTypeDef
FunctionOrTypeName   ::= QualifiedName
IfNotExists          ::= ( <IF> <NOT> <EXISTS> )?
TypeExpr             ::= ObjectTypeDef | TypeReference | ArrayTypeDef | MultisetTypeDef
ObjectTypeDef        ::= ( <CLOSED> | <OPEN> )? "{" ( ObjectField ( "," ObjectField )* )? "}"
ObjectField          ::= Identifier ":" ( TypeExpr ) ( "?" )?
NestedField          ::= Identifier ( "." Identifier )*
IndexField           ::= NestedField ( ":" TypeReference )?
TypeReference        ::= Identifier
ArrayTypeDef         ::= "[" ( TypeExpr ) "]"
MultisetTypeDef      ::= "{{" ( TypeExpr ) "}}"

The CREATE TYPE statement is used to create a new named datatype. This type can then be used to create stored collections or utilized when defining one or more other datatypes. Much more information about the data model is available in the data model reference guide. A new type can be a object type, a renaming of another type, an array type, or a multiset type. A object type can be defined as being either open or closed. Instances of a closed object type are not permitted to contain fields other than those specified in the create type statement. Instances of an open object type may carry additional fields, and open is the default for new types if neither option is specified.

The following example creates a new object type called GleambookUser type. Since it is defined as (defaulting to) being an open type, instances will be permitted to contain more than what is specified in the type definition. The first four fields are essentially traditional typed name/value pairs (much like SQL fields). The friendIds field is a multiset of integers. The employment field is an array of instances of another named object type, EmploymentType.

Example
CREATE TYPE GleambookUserType AS {
  id:         int,
  alias:      string,
  name:       string,
  userSince: datetime,
  friendIds: {{ int }},
  employment: [ EmploymentType ]
};

The next example creates a new object type, closed this time, called MyUserTupleType. Instances of this closed type will not be permitted to have extra fields, although the alias field is marked as optional and may thus be NULL or MISSING in legal instances of the type. Note that the type of the id field in the example is UUID. This field type can be used if you want to have this field be an autogenerated-PK field. (Refer to the Datasets section later for more details on such fields.)

Example
CREATE TYPE MyUserTupleType AS CLOSED {
  id:         uuid,
  alias:      string?,
  name:       string
};

Datasets

DatasetSpecification ::= ( <INTERNAL> )? <DATASET> QualifiedName "(" QualifiedName ")" IfNotExists
                           PrimaryKey ( <ON> Identifier )? ( <HINTS> Properties )?
                           ( "USING" "COMPACTION" "POLICY" CompactionPolicy ( Configuration )? )?
                           ( <WITH> <FILTER> <ON> Identifier )?
                          |
                           <EXTERNAL> <DATASET> QualifiedName "(" QualifiedName ")" IfNotExists <USING> AdapterName
                           Configuration ( <HINTS> Properties )?
                           ( <USING> <COMPACTION> <POLICY> CompactionPolicy ( Configuration )? )?
AdapterName          ::= Identifier
Configuration        ::= "(" ( KeyValuePair ( "," KeyValuePair )* )? ")"
KeyValuePair         ::= "(" StringLiteral "=" StringLiteral ")"
Properties           ::= ( "(" Property ( "," Property )* ")" )?
Property             ::= Identifier "=" ( StringLiteral | IntegerLiteral )
FunctionSignature    ::= FunctionOrTypeName "@" IntegerLiteral
PrimaryKey           ::= <PRIMARY> <KEY> NestedField ( "," NestedField )* ( <AUTOGENERATED> )?
CompactionPolicy     ::= Identifier

The CREATE DATASET statement is used to create a new dataset. Datasets are named, multisets of object type instances; they are where data lives persistently and are the usual targets for queries. Datasets are typed, and the system ensures that their contents conform to their type definitions. An Internal dataset (the default kind) is a dataset whose content lives within and is managed by the system. It is required to have a specified unique primary key field which uniquely identifies the contained objects. (The primary key is also used in secondary indexes to identify the indexed primary data objects.)

Internal datasets contain several advanced options that can be specified when appropriate. One such option is that random primary key (UUID) values can be auto-generated by declaring the field to be UUID and putting “AUTOGENERATED” after the “PRIMARY KEY” identifier. In this case, unlike other non-optional fields, a value for the auto-generated PK field should not be provided at insertion time by the user since each object’s primary key field value will be auto-generated by the system.

Another advanced option, when creating an Internal dataset, is to specify the merge policy to control which of the underlying LSM storage components to be merged. (The system supports Log-Structured Merge tree based physical storage for Internal datasets.) Currently the system supports four different component merging policies that can be chosen per dataset: no-merge, constant, prefix, and correlated-prefix. The no-merge policy simply never merges disk components. The constant policy merges disk components when the number of components reaches a constant number k that can be configured by the user. The prefix policy relies on both component sizes and the number of components to decide which components to merge. It works by first trying to identify the smallest ordered (oldest to newest) sequence of components such that the sequence does not contain a single component that exceeds some threshold size M and that either the sum of the component’s sizes exceeds M or the number of components in the sequence exceeds another threshold C. If such a sequence exists, the components in the sequence are merged together to form a single component. Finally, the correlated-prefix policy is similar to the prefix policy, but it delegates the decision of merging the disk components of all the indexes in a dataset to the primary index. When the correlated-prefix policy decides that the primary index needs to be merged (using the same decision criteria as for the prefix policy), then it will issue successive merge requests on behalf of all other indexes associated with the same dataset. The system’s default policy is the prefix policy except when there is a filter on a dataset, where the preferred policy for filters is the correlated-prefix.

Another advanced option shown in the syntax above, related to performance and mentioned above, is that a filter can optionally be created on a field to further optimize range queries with predicates on the filter’s field. Filters allow some range queries to avoid searching all LSM components when the query conditions match the filter. (Refer to Filter-Based LSM Index Acceleration for more information about filters.)

An External dataset, in contrast to an Internal dataset, has data stored outside of the system’s control. Files living in HDFS or in the local filesystem(s) of a cluster’s nodes are currently supported. External dataset support allows queries to treat foreign data as though it were stored in the system, making it possible to query “legacy” file data (for example, Hive data) without having to physically import it. When defining an External dataset, an appropriate adapter type must be selected for the desired external data. (See the Guide to External Data for more information on the available adapters.)

The following example creates an Internal dataset for storing FacefookUserType objects. It specifies that their id field is their primary key.

Example

CREATE INTERNAL DATASET GleambookUsers(GleambookUserType) PRIMARY KEY id;

The next example creates another Internal dataset (the default kind when no dataset kind is specified) for storing MyUserTupleType objects. It specifies that the id field should be used as the primary key for the dataset. It also specifies that the id field is an auto-generated field, meaning that a randomly generated UUID value should be assigned to each incoming object by the system. (A user should therefore not attempt to provide a value for this field.) Note that the id field’s declared type must be UUID in this case.

Example

CREATE DATASET MyUsers(MyUserTupleType) PRIMARY KEY id AUTOGENERATED;

The next example creates an External dataset for querying LineItemType objects. The choice of the hdfs adapter means that this dataset’s data actually resides in HDFS. The example CREATE statement also provides parameters used by the hdfs adapter: the URL and path needed to locate the data in HDFS and a description of the data format.

Example

CREATE EXTERNAL DATASET LineItem(LineItemType) USING hdfs (
  ("hdfs"="hdfs://HOST:PORT"),
  ("path"="HDFS_PATH"),
  ("input-format"="text-input-format"),
  ("format"="delimited-text"),
  ("delimiter"="|"));

Indices

IndexSpecification ::= <INDEX> Identifier IfNotExists <ON> QualifiedName
                       "(" ( IndexField ) ( "," IndexField )* ")" ( "type" IndexType "?")?
                       ( (<NOT>)? <ENFORCED> )?
IndexType          ::= <BTREE> | <RTREE> | <KEYWORD> | <NGRAM> "(" IntegerLiteral ")"

The CREATE INDEX statement creates a secondary index on one or more fields of a specified dataset. Supported index types include BTREE for totally ordered datatypes, RTREE for spatial data, and KEYWORD and NGRAM for textual (string) data. An index can be created on a nested field (or fields) by providing a valid path expression as an index field identifier.

An indexed field is not required to be part of the datatype associated with a dataset if the dataset’s datatype is declared as open and if the field’s type is provided along with its name and if the ENFORCED keyword is specified at the end of the index definition. ENFORCING an open field introduces a check that makes sure that the actual type of the indexed field (if the optional field exists in the object) always matches this specified (open) field type.

The following example creates a btree index called gbAuthorIdx on the authorId field of the GleambookMessages dataset. This index can be useful for accelerating exact-match queries, range search queries, and joins involving the author-id field.

Example

CREATE INDEX gbAuthorIdx ON GleambookMessages(authorId) TYPE BTREE;

The following example creates an open btree index called gbSendTimeIdx on the (non-predeclared) sendTime field of the GleambookMessages dataset having datetime type. This index can be useful for accelerating exact-match queries, range search queries, and joins involving the sendTime field. The index is enforced so that records that do not have the “sendTime” field or have a mismatched type on the field cannot be inserted into the dataset.

Example

CREATE INDEX gbSendTimeIdx ON GleambookMessages(sendTime: datetime?) TYPE BTREE ENFORCED;

The following example creates a btree index called crpUserScrNameIdx on screenName, a nested field residing within a object-valued user field in the ChirpMessages dataset. This index can be useful for accelerating exact-match queries, range search queries, and joins involving the nested screenName field. Such nested fields must be singular, i.e., one cannot index through (or on) an array-valued field.

Example

CREATE INDEX crpUserScrNameIdx ON ChirpMessages(user.screenName) TYPE BTREE;

The following example creates an rtree index called gbSenderLocIdx on the sender-location field of the GleambookMessages dataset. This index can be useful for accelerating queries that use the spatial-intersect function in a predicate involving the sender-location field.

Example

CREATE INDEX gbSenderLocIndex ON GleambookMessages("sender-location") TYPE RTREE;

The following example creates a 3-gram index called fbUserIdx on the name field of the GleambookUsers dataset. This index can be used to accelerate some similarity or substring maching queries on the name field. For details refer to the document on similarity queries.

Example

CREATE INDEX fbUserIdx ON GleambookUsers(name) TYPE NGRAM(3);

The following example creates a keyword index called fbMessageIdx on the message field of the GleambookMessages dataset. This keyword index can be used to optimize queries with token-based similarity predicates on the message field. For details refer to the document on similarity queries.

Example

CREATE INDEX fbMessageIdx ON GleambookMessages(message) TYPE KEYWORD;

The following example creates an open btree index called gbReadTimeIdx on the (non-predeclared) readTime field of the GleambookMessages dataset having datetime type. This index can be useful for accelerating exact-match queries, range search queries, and joins involving the readTime field. The index is not enforced so that records that do not have the readTime field or have a mismatched type on the field can still be inserted into the dataset.

Example

CREATE INDEX gbReadTimeIdx ON GleambookMessages(readTime: datetime?);

Functions

The create function statement creates a named function that can then be used and reused in queries. The body of a function can be any query expression involving the function’s parameters.

FunctionSpecification ::= "FUNCTION" FunctionOrTypeName IfNotExists ParameterList "{" Expression "}"

The following is an example of a CREATE FUNCTION statement which is similar to our earlier DECLARE FUNCTION example. It differs from that example in that it results in a function that is persistently registered by name in the specified dataverse (the current dataverse being used, if not otherwise specified).

Example
CREATE FUNCTION friendInfo(userId) {
    (SELECT u.id, u.name, len(u.friendIds) AS friendCount
     FROM GleambookUsers u
     WHERE u.id = userId)[0]
 };

Removal

DropStatement       ::= "DROP" ( "DATAVERSE" Identifier IfExists
                               | "TYPE" FunctionOrTypeName IfExists
                               | "DATASET" QualifiedName IfExists
                               | "INDEX" DoubleQualifiedName IfExists
                               | "FUNCTION" FunctionSignature IfExists )
IfExists            ::= ( "IF" "EXISTS" )?

The DROP statement is the inverse of the CREATE statement. It can be used to drop dataverses, datatypes, datasets, indexes, and functions.

The following examples illustrate some uses of the DROP statement.

Example
DROP DATASET GleambookUsers IF EXISTS;

DROP INDEX GleambookMessages.gbSenderLocIndex;

DROP TYPE TinySocial2.GleambookUserType;

DROP FUNCTION friendInfo@1;

DROP DATAVERSE TinySocial;

When an artifact is dropped, it will be droppped from the current dataverse if none is specified (see the DROP DATASET example above) or from the specified dataverse (see the DROP TYPE example above) if one is specified by fully qualifying the artifact name in the DROP statement. When specifying an index to drop, the index name must be qualified by the dataset that it indexes. When specifying a function to drop, since the query language allows functions to be overloaded by their number of arguments, the identifying name of the function to be dropped must explicitly include that information. (friendInfo@1 above denotes the 1-argument function named friendInfo in the current dataverse.)

Load Statement

LoadStatement  ::= <LOAD> <DATASET> QualifiedName <USING> AdapterName Configuration ( <PRE-SORTED> )?

The LOAD statement is used to initially populate a dataset via bulk loading of data from an external file. An appropriate adapter must be selected to handle the nature of the desired external data. The LOAD statement accepts the same adapters and the same parameters as discussed earlier for External datasets. (See the guide to external data for more information on the available adapters.) If a dataset has an auto-generated primary key field, the file to be imported should not include that field in it.

The following example shows how to bulk load the GleambookUsers dataset from an external file containing data that has been prepared in ADM (Asterix Data Model) format.

Example
 LOAD DATASET GleambookUsers USING localfs
    (("path"="127.0.0.1:///Users/bignosqlfan/tinysocialnew/gbu.adm"),("format"="adm"));

Modification statements

INSERTs

InsertStatement ::= <INSERT> <INTO> QualifiedName Query

The INSERT statement is used to insert new data into a dataset. The data to be inserted comes from a query expression. This expression can be as simple as a constant expression, or in general it can be any legal query. If the target dataset has an auto-generated primary key field, the insert statement should not include a value for that field in it. (The system will automatically extend the provided object with this additional field and a corresponding value.) Insertion will fail if the dataset already has data with the primary key value(s) being inserted.

Inserts are processed transactionally by the system. The transactional scope of each insert transaction is the insertion of a single object plus its affiliated secondary index entries (if any). If the query part of an insert returns a single object, then the INSERT statement will be a single, atomic transaction. If the query part returns multiple objects, each object being inserted will be treated as a separate tranaction. The following example illustrates a query-based insertion.

Example
INSERT INTO UsersCopy (SELECT VALUE user FROM GleambookUsers user)

UPSERTs

UpsertStatement ::= <UPSERT> <INTO> QualifiedName Query

The UPSERT statement syntactically mirrors the INSERT statement discussed above. The difference lies in its semantics, which for UPSERT are “add or replace” instead of the INSERT “add if not present, else error” semantics. Whereas an INSERT can fail if another object already exists with the specified key, the analogous UPSERT will replace the previous object’s value with that of the new object in such cases.

The following example illustrates a query-based upsert operation.

Example
UPSERT INTO UsersCopy (SELECT VALUE user FROM GleambookUsers user)

*Editor’s note: Upserts currently work in AQL but are not yet enabled (at the moment) in the current query language.

DELETEs

DeleteStatement ::= <DELETE> <FROM> QualifiedName ( ( <AS> )? Variable )? ( <WHERE> Expression )?

The DELETE statement is used to delete data from a target dataset. The data to be deleted is identified by a boolean expression involving the variable bound to the target dataset in the DELETE statement.

Deletes are processed transactionally by the system. The transactional scope of each delete transaction is the deletion of a single object plus its affiliated secondary index entries (if any). If the boolean expression for a delete identifies a single object, then the DELETE statement itself will be a single, atomic transaction. If the expression identifies multiple objects, then each object deleted will be handled as a separate transaction.

The following examples illustrate single-object deletions.

Example
DELETE FROM GleambookUsers user WHERE user.id = 8;
Example
DELETE FROM GleambookUsers WHERE id = 5;

Appendix 1. Reserved keywords

All reserved keywords are listed in the following table:

AND ANY APPLY AS ASC AT
AUTOGENERATED BETWEEN BTREE BY CASE CLOSED
CREATE COMPACTION COMPACT CONNECT CORRELATE DATASET
COLLECTION DATAVERSE DECLARE DEFINITION DECLARE DEFINITION
DELETE DESC DISCONNECT DISTINCT DROP ELEMENT
ELEMENT EXPLAIN ELSE ENFORCED END EVERY
EXCEPT EXIST EXTERNAL FEED FILTER FLATTEN
FOR FROM FULL FUNCTION GROUP HAVING
HINTS IF INTO IN INDEX INGESTION
INNER INSERT INTERNAL INTERSECT IS JOIN
KEYWORD LEFT LETTING LET LIKE LIMIT
LOAD NODEGROUP NGRAM NOT OFFSET ON
OPEN OR ORDER OUTER OUTPUT PATH
POLICY PRE-SORTED PRIMARY RAW REFRESH RETURN
RTREE RUN SATISFIES SECONDARY SELECT SET
SOME TEMPORARY THEN TYPE UNKNOWN UNNEST
UPDATE USE USING VALUE WHEN WHERE
WITH WRITE

Appendix 2. Performance Tuning

The SET statement can be used to override some cluster-wide configuration parameters for a specific request:

SET <IDENTIFIER> <STRING_LITERAL>

As parameter identifiers are qualified names (containing a ‘.’) they have to be escaped using backticks (``). Note that changing query parameters will not affect query correctness but only impact performance characteristics, such as response time and throughput.

Parallelism Parameter

The system can execute each request using multiple cores on multiple machines (a.k.a., partitioned parallelism) in a cluster. A user can manually specify the maximum execution parallelism for a request to scale it up and down using the following parameter:

  • compiler.parallelism: the maximum number of CPU cores can be used to process a query. There are three cases of the value p for compiler.parallelism:
    • p < 0 or p > the total number of cores in a cluster: the system will use all available cores in the cluster;

    • p = 0 (the default): the system will use the storage parallelism (the number of partitions of stored datasets) as the maximum parallelism for query processing;

    • all other cases: the system will use the user-specified number as the maximum number of CPU cores to use for executing the query.

Example
SET `compiler.parallelism` "16";

SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u JOIN GleambookMessages m ON m.authorId = u.id;

Memory Parameters

In the system, each blocking runtime operator such as join, group-by and order-by works within a fixed memory budget, and can gracefully spill to disks if the memory budget is smaller than the amount of data they have to hold. A user can manually configure the memory budget of those operators within a query. The supported configurable memory parameters are:

  • compiler.groupmemory: the memory budget that each parallel group-by operator instance can use; 32MB is the default budget.

  • compiler.sortmemory: the memory budget that each parallel sort operator instance can use; 32MB is the default budget.

  • compiler.joinmemory: the memory budget that each parallel hash join operator instance can use; 32MB is the default budget.

For each memory budget value, you can use a 64-bit integer value with a 1024-based binary unit suffix (for example, B, KB, MB, GB). If there is no user-provided suffix, “B” is the default suffix. See the following examples.

Example
SET `compiler.groupmemory` "64MB";

SELECT msg.authorId, COUNT(*)
FROM GleambookMessages msg
GROUP BY msg.authorId;
Example
SET `compiler.sortmemory` "67108864";

SELECT VALUE user
FROM GleambookUsers AS user
ORDER BY ARRAY_LENGTH(user.friendIds) DESC;
Example
SET `compiler.joinmemory` "132000KB";

SELECT u.name AS uname, m.message AS message
FROM GleambookUsers u JOIN GleambookMessages m ON m.authorId = u.id;

Controlling Index-Only-Plan Parameter

By default, the system tries to build an index-only plan whenever utilizing a secondary index is possible. For example, if a SELECT or JOIN query can utilize an enforced B+Tree or R-Tree index on a field, the optimizer checks whether a secondary-index search alone can generate the result that the query asks for. It mainly checks two conditions: (1) predicates used in WHERE only uses the primary key field and/or secondary key field and (2) the result does not return any other fields. If these two conditions hold, it builds an index-only plan. Since an index-only plan only searches a secondary-index to answer a query, it is faster than a non-index-only plan that needs to search the primary index. However, this index-only plan can be turned off per query by setting the following parameter.

  • noindexonly: if this is set to true, the index-only-plan will not be applied; the default value is false.
Example
SET noindexonly 'true';

SELECT m.message AS message
FROM GleambookMessages m where m.message = " love product-b its shortcut-menu is awesome:)";

Appendix 3. Variable Bindings and Name Resolution

In this Appendix, we’ll look at how variables are bound and how names are resolved. Names can appear in every clause of a query. Sometimes a name consists of just a single identifier, e.g., region or revenue. More often a name will consist of two identifiers separated by a dot, e.g., customer.address. Occasionally a name may have more than two identifiers, e.g., policy.owner.address.zipcode. Resolving a name means determining exactly what the (possibly multi-part) name refers to. It is necessary to have well-defined rules for how to resolve a name in cases of ambiguity. (In the absence of schemas, such cases arise more commonly, and also differently, than they do in SQL.)

The basic job of each clause in a query block is to bind variables. Each clause sees the variables bound by previous clauses and may bind additional variables. Names are always resolved with respect to the variables that are bound (“in scope”) at the place where the name use in question occurs. It is possible that the name resolution process will fail, which may lead to an empty result or an error message.

One important bit of background: When the system is reading a query and resolving its names, it has a list of all the available dataverses and datasets. As a result, it knows whether a.b is a valid name for dataset b in dataverse a. However, the system does not in general have knowledge of the schemas of the data inside the datasets; remember that this is a much more open world. As a result, in general the system cannot know whether any object in a particular dataset will have a field named c. These assumptions affect how errors are handled. If you try to access dataset a.b and no dataset by that name exists, you will get an error and your query will not run. However, if you try to access a field c in a collection of objects, your query will run and return missing for each object that doesn’t have a field named c – this is because it’s possible that some object (someday) could have such a field.

Binding Variables

Variables can be bound in the following ways:

  1. WITH and LET clauses bind a variable to the result of an expression in a straightforward way

    Examples:

    WITH cheap_parts AS (SELECT partno FROM parts WHERE price < 100) binds the variable cheap_parts to the result of the subquery.

    LET pay = salary + bonus binds the variable pay to the result of evaluating the expression salary + bonus.

  2. FROM, GROUP BY, and SELECT clauses have optional AS subclauses that contain an expression and a name (called an iteration variable in a FROM clause, or an alias in GROUP BY or SELECT.)

    Examples:

    FROM customer AS c, order AS o

    GROUP BY salary + bonus AS total_pay

    SELECT MAX(price) AS highest_price

    An AS subclause always binds the name (as a variable) to the result of the expression (or, in the case of a FROM clause, to the individual members of the collection identified by the expression.)

    It’s always a good practice to use the keyword AS when defining an alias or iteration variable. However, as in SQL, the syntax allows the keyword AS to be omitted. For example, the FROM clause above could have been written like this:

    FROM customer c, order o

    Omitting the keyword AS does not affect the binding of variables. The FROM clause in this example binds variables c and o whether the keyword AS is used or not.

    In certain cases, a variable is automatically bound even if no alias or variable-name is specified. Whenever an expression could have been followed by an AS subclause, if the expression consists of a simple name or a path expression, that expression binds a variable whose name is the same as the simple name or the last step in the path expression. Here are some examples:

    FROM customer, order binds iteration variables named customer and order

    GROUP BY address.zipcode binds a variable named zipcode

    SELECT item[0].price binds a variable named price

    Note that a FROM clause iterates over a collection (usually a dataset), binding a variable to each member of the collection in turn. The name of the collection remains in scope, but it is not a variable. For example, consider this FROM clause used in a self-join:

    FROM customer AS c1, customer AS c2

    This FROM clause joins the customer dataset to itself, binding the iteration variables c1 and c2 to objects in the left-hand-side and right-hand-side of the join, respectively. After the FROM clause, c1 and c2 are in scope as variables, and customer remains accessible as a dataset name but not as a variable.

  3. Special rules for GROUP BY:

    1. If a GROUP BY clause specifies an expression that has no explicit alias, it binds a pseudo-variable that is lexicographically identical to the expression itself. For example:

      GROUP BY salary + bonus binds a pseudo-variable named salary + bonus.

      This rule allows subsequent clauses to refer to the grouping expression (salary + bonus) even though its constituent variables (salary and bonus) are no longer in scope. For example, the following query is valid:

      FROM employee
      GROUP BY salary + bonus
      HAVING salary + bonus > 1000
      SELECT salary + bonus, COUNT(*) AS how_many
      

      While it might have been more elegant to explicitly require an alias in cases like this, the pseudo-variable rule is retained for SQL compatibility. Note that the expression salary + bonus is not actually evaluated in the HAVING and SELECT clauses (and could not be since salary and bonus are no longer individually in scope). Instead, the expression salary + bonus is treated as a reference to the pseudo-variable defined in the GROUP BY clause.

    2. A GROUP BY clause may be followed by a GROUP AS clause that binds a variable to the group. The purpose of this variable is to make the individual objects inside the group visible to subqueries that may need to iterate over them.

      The GROUP AS variable is bound to a multiset of objects. Each object represents one of the members of the group. Since the group may have been formed from a join, each of the member-objects contains a nested object for each variable bound by the nearest FROM clause (and its LET subclause, if any). These nested objects, in turn, contain the actual fields of the group-member. To understand this process, consider the following query fragment:

      FROM parts AS p, suppliers AS s
      WHERE p.suppno = s.suppno
      GROUP BY p.color GROUP AS g
      

      Suppose that the objects in parts have fields partno, color, and suppno. Suppose that the objects in suppliers have fields suppno and location.

      Then, for each group formed by the GROUP BY, the variable g will be bound to a multiset with the following structure:

      [ { "p": { "partno": "p1", "color": "red", "suppno": "s1" },
          "s": { "suppno": "s1", "location": "Denver" } },
        { "p": { "partno": "p2", "color": "red", "suppno": "s2" },
          "s": { "suppno": "s2", "location": "Atlanta" } },
        ...
      ]
      

Scoping

In general, the variables that are in scope at a particular position are those variables that were bound earlier in the current query block, in outer (enclosing) query blocks, or in a WITH clause at the beginning of the query. More specific rules follow.

The clauses in a query block are conceptually processed in the following order:

  • FROM (followed by LET subclause, if any)
  • WHERE
  • GROUP BY (followed by LET subclause, if any)
  • HAVING
  • SELECT or SELECT VALUE
  • ORDER BY
  • OFFSET
  • LIMIT

During processing of each clause, the variables that are in scope are those variables that are bound in the following places:

  1. In earlier clauses of the same query block (as defined by the ordering given above).

    Example: FROM orders AS o SELECT o.date The variable o in the SELECT clause is bound, in turn, to each object in the dataset orders.

  2. In outer query blocks in which the current query block is nested. In case of duplication, the innermost binding wins.

  3. In the WITH clause (if any) at the beginning of the query.

However, in a query block where a GROUP BY clause is present:

  1. In clauses processed before GROUP BY, scoping rules are the same as though no GROUP BY were present.

  2. In clauses processed after GROUP BY, the variables bound in the nearest FROM-clause (and its LET subclause, if any) are removed from scope and replaced by the variables bound in the GROUP BY clause (and its LET subclause, if any). However, this replacement does not apply inside the arguments of the five SQL special aggregating functions (MIN, MAX, AVG, SUM, and COUNT). These functions still need to see the individual data items over which they are computing an aggregation. For example, after FROM employee AS e GROUP BY deptno, it would not be valid to reference e.salary, but AVG(e.salary) would be valid.

Special case: In an expression inside a FROM clause, a variable is in scope if it was bound in an earlier expression in the same FROM clause. Example:

FROM orders AS o, o.items AS i

The reason for this special case is to support iteration over nested collections.

Note that, since the SELECT clause comes after the WHERE and GROUP BY clauses in conceptual processing order, any variables defined in SELECT are not visible in WHERE or GROUP BY. Therefore the following query will not return what might be the expected result (since in the WHERE clause, pay will be interpreted as a field in the emp object rather than as the computed value salary + bonus):

SELECT name, salary + bonus AS pay
FROM emp
WHERE pay > 1000
ORDER BY pay

The likely intent of the query above can be accomplished as follows:

FROM emp AS e
LET pay = e.salary + e.bonus
WHERE pay > 1000
SELECT e.name, pay
ORDER BY pay

Note that variables defined by JOIN subclauses are not visible to other subclauses in the same FROM clause. This also applies to the FROM variable that starts the JOIN subclause.

Resolving Names

The process of name resolution begins with the leftmost identifier in the name. The rules for resolving the leftmost identifier are:

  1. In a FROM clause: Names in a FROM clause identify the collections over which the query block will iterate. These collections may be stored datasets or may be the results of nested query blocks. A stored dataset may be in a named dataverse or in the default dataverse. Thus, if the two-part name a.b is in a FROM clause, a might represent a dataverse and b might represent a dataset in that dataverse. Another example of a two-part name in a FROM clause is FROM orders AS o, o.items AS i. In o.items, o represents an order object bound earlier in the FROM clause, and items represents the items object inside that order.

    The rules for resolving the leftmost identifier in a FROM clause (including a JOIN subclause), or in the expression following IN in a quantified predicate, are as follows:

    1. If the identifier matches a variable-name that is in scope, it resolves to the binding of that variable. (Note that in the case of a subquery, an in-scope variable might have been bound in an outer query block; this is called a correlated subquery.)

    2. Otherwise, if the identifier is the first part of a two-part name like a.b, the name is treated as dataverse.dataset. If the identifier stands alone as a one-part name, it is treated as the name of a dataset in the default dataverse. An error will result if the designated dataverse or dataset does not exist.

  2. Elsewhere in a query block: In clauses other than FROM, a name typically identifies a field of some object. For example, if the expression a.b is in a SELECT or WHERE clause, it’s likely that a represents an object and b represents a field in that object.

    The rules for resolving the leftmost identifier in clauses other than the ones listed in Rule 1 are:

    1. If the identifier matches a variable-name that is in scope, it resolves to the binding of that variable. (In the case of a correlated subquery, the in-scope variable might have been bound in an outer query block.)

    2. (The “Single Variable Rule”): Otherwise, if the FROM clause (or a LET clause if there is no FROM clause) in the current query block binds exactly one variable, the identifier is treated as a field access on the object bound to that variable. For example, in the query FROM customer SELECT address, the identifier address is treated as a field in the object bound to the variable customer. At runtime, if the object bound to customer has no address field, the address expression will return missing. If the FROM clause (and its LET subclause, if any) in the current query block binds multiple variables, name resolution fails with an “ambiguous name” error. Note that the Single Variable Rule searches for bound variables only in the current query block, not in outer (containing) blocks. The purpose of this rule is to permit the compiler to resolve field-references unambiguously without relying on any schema information.

      Exception: In a query that has a GROUP BY clause, the Single Variable Rule does not apply in any clauses that occur after the GROUP BY because, in these clauses, the variables bound by the FROM clause are no longer in scope. In clauses after GROUP BY, only Rule 2.1 applies.

  3. In an ORDER BY clause following a UNION ALL expression:

    The leftmost identifier is treated as a field-access on the objects that are generated by the UNION ALL. For example:

    query-block-1
    UNION ALL
    query-block-2
    ORDER BY salary
    

    In the result of this query, objects that have a foo field will be ordered by the value of this field; objects that have no foo field will appear at at the beginning of the query result (in ascending order) or at the end (in descending order.)

  4. Once the leftmost identifier has been resolved, the following dots and identifiers in the name (if any) are treated as a path expression that navigates to a field nested inside that object. The name resolves to the field at the end of the path. If this field does not exist, the value missing is returned.