A SELECT clause that specifies to return only distinct
values.
Description
The optional DISTINCT clause appears after
the SELECT keyword and before the optional TOP clause and the first select-item.
The DISTINCT clause is applied to the result
set of the SELECT statement. It limits
the rows returned to one arbitrary row for each distinct (unique)
value. If no DISTINCT clause is specified, the default is to display
all the rows that fulfill the SELECT criteria.
The ALL clause is the same as specifying no DEFAULT clause; if you
specify ALL, SELECT returns all the rows in the
table that fulfill the SELECT criteria.
The DISTINCT clause has two forms:
-
SELECT DISTINCT: Returns one row for each unique combination
of select-item values. You can specify one or more
than one select-items. For example, the following
query returns a row with Home_State and Age values for each unique
combination of Home_State and Age values:
SELECT DISTINCT Home_State,Age FROM Sample.Person
-
SELECT DISTINCT BY (item): Returns
one row for each unique combination of item values.
You can specify a single item or a comma-separated
list of items. The specified item or item list must be enclosed in parentheses.
Spaces may be specified or omitted between the BY keyword and the
parentheses. The select-item list may, but does
not have to, include the specified item(s). For
example, the following query returns a row with Name and Age values
for each unique combination of Home_State and Age values:
SELECT DISTINCT BY (Home_State,Age) Name,Age FROM Sample.Person
The item field(s) must be specified by column
name. Valid values include the following: a column name (DISTINCT BY (City)); an %ID (which returns all rows); a
scalar function specifying a column name (DISTINCT BY (ROUND(Age,-1))); a collation function specifying a column name (DISTINCT BY (%EXACT(City))). You cannot specify a field by column alias; attempting to do so
generates an SQLCODE -29 error. You cannot specify a field by column
number; this is interpreted as a literal and returns one row. Specifying
a literal as the item value in a DISTINCT clause returns 1 row; which row is returned is indeterminate. Thus,
specifying 7, 'Chicago', '', 0, or NULL all return 1 row. However,
if you specify a literal as an item value in a
comma-separated list, the literal is ignored and DISTINCT selects one arbitrary row for each unique combination of the specified
field names.
The DISTINCT clause is applied before the TOP clause. If both
are specified, the SELECT returns only rows with
unique values, the number of unique value rows specified in the TOP
clause.
If the column specified in the DISTINCT clause has rows that
are NULL (contain no value), DISTINCT returns one row with NULL as
a distinct (unique) value, as shown in the following examples:
SELECT DISTINCT FavoriteColors FROM Sample.Person
SELECT DISTINCT BY (FavoriteColors) Name,FavoriteColors FROM Sample.Person
ORDER BY FavoriteColors
A DISTINCT clause is not meaningful in an Embedded SQL simple query, because in this type of Embedded
SQL a SELECT always returns only one row of data.
However, an Embedded SQL cursor–based
query can return multiple rows of data; in a cursor-based
query, a DISTINCT clause returns only unique value rows.
DISTINCT and ORDER BY
The DISTINCT clause is applied before the ORDER BY clause. Therefore, the combination of DISTINCT and
ORDER BY will first select an arbitrary row that satisfies the DISTINCT
clause, then order those rows based on the ORDER BY clause.
DISTINCT and GROUP BY
DISTINCT and GROUP BY both group records by a specified field
(or fields) and return one record for each unique value of that field.
One significant difference between them is that DISTINCT calculates
aggregate functions before grouping. GROUP BY calculates aggregate
functions after grouping. This difference is shown in the following
examples:
SELECT DISTINCT BY (ROUND(Age,-1)) Age,AVG(Age) AS AvgAge FROM Sample.Person
/* AVG(Age) returns average of all ages in table */
SELECT Age,AVG(Age) AS AvgAge FROM Sample.Person GROUP BY ROUND(Age,-1)
/* AVG(Age) returns an average age for each age group */
A DISTINCT clause can be specified with one or more aggregate
function fields, though this is rarely meaningful because an aggregate
function returns a single value. Thus the following example returns
a single row:
SELECT DISTINCT BY (AVG(Age)) Name,Age,AVG(Age) FROM Sample.Person
Caution:
If a DISTINCT clause contains aggregate functions as the only item or select-item is used with a GROUP BY clause, the DISTINCT
clause is ignored. The intended combination of DISTINCT, aggregate
function, and GROUP BY can be achieved using a subquery. For further
details and program examples, refer to the GROUP BY clause reference page.
Letter Case and DISTINCT Optimization
DISTINCT groups together string values based
on the collation type defined
for the field. By default, string data type fields are defined with
SQLUPPER collation, which is not case-sensitive. The “Collation” chapter of Using InterSystems SQL provides
details on defining the string collation
default for the current namespace and specifying a non-default field collation type when
defining a field/property.
If the field/property collation type is SQLUPPER, grouped field
values are returned in all uppercase letters. To group values by original
letter case, or to display the returned values for a grouped field
in their original letter case, use the %EXACT collation function. This is shown in the following examples, which
assume that the Home_City field is defined with collation type SQLUPPER
and contains the values ‘New York’ and ‘new york’:
SELECT DISTINCT BY (Home_City) Name,Home_City FROM Sample.Person
/* groups together Home_City values by their uppercase letter values
returns the name of each grouped city in uppercase letters.
Thus, 'NEW YORK' is returned. */
SELECT DISTINCT BY (Home_City) Name,%EXACT(Home_City) FROM Sample.Person
/* groups together Home_City values by their uppercase letter values
returns the name of each grouped city in original letter case.
Thus, 'New York' or 'new york' may be returned, but not both. */
SELECT DISTINCT BY (%EXACT(Home_City)) Name,Home_City FROM Sample.Person
/* groups together Home_City values by their original letter case
returns the name of each grouped city in original letter case.
Thus, both 'New York' and 'new york' are returned.
Optimization is not used. */
You can optimize query performance for queries that contain
a DISTINCT clause by using the Management Portal. Select System Administration, Configuration, SQL and Object Settings, SQL. View and edit the GROUP BY and DISTINCT queries must
produce original values option. (This optimization also
works for the GROUP BY clause.)
The default is “No”.
This default groups alphabetic values by their uppercase letter
collation. This optimization takes advantage of indexes for the selected
field(s). It is therefore only meaningful if an index exists for one
or more of the selected fields. It collates field values as they are
stored in the index; alphabetic strings are returned in all uppercase
letters. You can set this system-wide option, then override it for
specific queries by using the %EXACT collation function to preserve
letter case.
For further details, refer to the SQL and Object Settings Pages listed in System Administration Guide.
You can also set this option system-wide using the $SYSTEM.SQL.Util.SetOption()Opens in a new tab method FastDistinct option. To determine the current setting, call $SYSTEM.SQL.CurrentSettings()Opens in a new tab, which displays the DISTINCT optimization turned on setting; the default is
1.
Other Uses of DISTINCT
-
Stream Field: DISTINCT operates on the OID of a stream
field, not its actual data. Because all stream field OIDs are unique
values, DISTINCT has no effect on actual stream field duplicate data
values. DISTINCT BY (StreamField) reduces the number
records where the stream field is NULL to one NULL record. For further
details, see Storing and Using Stream Data
(BLOBs and CLOBs).
-
Asterisk Syntax: The syntax DISTINCT * is legal, but not meaningful, because all rows, by definition, contain
some distinct unique identifier. The syntax DISTINCT BY (*) is not legal.
-
Subquery: The use of a DISTINCT clause in a subquery
is legal, but not meaningful, because a subquery returns a single
value.
-
No Row Data Selected: The DISTINCT clause can be used
with a SELECT that does not access any table data.
If the SELECT contains a FROM clause, specifying
DISTINCT results in one row contain these non-table values; if you
do not specify DISTINCT (or TOP) the SELECT results
in as many rows with identical values as the number of rows in the
FROM clause table. If the SELECT does not contain
a FROM clause, DISTINCT is legal but not meaningful. See FROM clause for more details.
-
Aggregate Function: A DISTINCT clause can be used
within an aggregate function to select only distinct (unique) field values for inclusion in the
aggregate. Unlike the SELECT DISTINCT clause, DISTINCT
within an aggregate function does not include NULL as a distinct (unique)
value. Note that the MAX and MIN aggregate functions parse DISTINCT clause syntax without error,
but this syntax performs no operation.
DISTINCT and %ROWID
Specifying the DISTINCT keyword causes a cursor-based Embedded SQL query to
not set the %ROWID variable.
%ROWID is not set even when DISTINCT does not limit the rows returned.
This is shown in the following example:
SET %ROWID=999
&sql(DECLARE EmpCursor CURSOR FOR
SELECT DISTINCT Name, Home_State
INTO :name,:state FROM Sample.Person
WHERE Home_State %STARTSWITH 'M')
&sql(OPEN EmpCursor)
QUIT:(SQLCODE'=0)
FOR { &sql(FETCH EmpCursor)
QUIT:SQLCODE
WRITE !,"RowID: ",%ROWID," row count: ",%ROWCOUNT
WRITE " Name=",name," State=",state
}
&sql(CLOSE EmpCursor)
This change of query behavior only applies to cursor-based Embedded
SQL SELECT queries. Dynamic SQL SELECT queries and non-cursor Embedded SQL SELECT queries
never set %ROWID.
DISTINCT and Transaction Processing
Specifying the DISTINCT keyword causes a query to retrieve all
current data, including data that has not yet been committed by the
current transaction. The transaction’s READ COMMITTED isolation
mode parameter (if set) is ignored; all data is retrieved in READ
UNCOMMITTED mode. For further details, refer to Transaction Processing in the
“Modifying the Database” chapter of Using InterSystems
SQL.
Examples
The following query returns one row for each distinct Home_State
value:
SELECT DISTINCT Home_State FROM Sample.Person
ORDER BY Home_State
The following query returns one row for each distinct Home_State
value, but returns additional fields for that row. The row that is
retrieved is not predictable:
SELECT DISTINCT BY (Home_State) %ID,Name,Home_State,Office_State FROM Sample.Person
ORDER BY Home_State
The following query returns one row for each distinct combination
of Home_State and Office_State values. Depending on the data, it will
either return more rows or the same number of rows as the previous
example:
SELECT DISTINCT BY (Home_State,Office_State) %ID,Name,Home_State,Office_State FROM Sample.Person
ORDER BY Home_State,Office_State
The following query uses DISTINCT BY to return one row for each
distinct Name length:
SELECT DISTINCT BY ($LENGTH(Name)) Name,$LENGTH(Name) AS lname
FROM Sample.Person
ORDER BY lname
The following query uses DISTINCT BY to return one row for each
distinct first element of FavoriteColors %List values. It lists one
distinct row with FavoriteColors NULL:
SELECT DISTINCT BY ($LIST(FavoriteColors,1)) Name,FavoriteColors,$LIST(FavoriteColors,1) AS FirstColor
FROM Sample.Person
The following query returns the first 20 distinct Home_State
values retrieved from Sample.Person in ascending collation sequence
order. The “top” rows reflect the ORDER BY clause sequencing
of all of the rows in Sample.Person.
SELECT DISTINCT TOP 20 Home_State FROM Sample.Person ORDER BY Home_State
The following query uses DISTINCT in both the main query and
in a WHERE clause subquery. It returns the first 20 distinct Home_State
values in Sample.Person that are also in Sample.Employee. If the subquery
DISTINCT was not provided, it would retrieve the distinct Home_State
values in Sample.Person that match a random selection of Home_State
values in Sample.Employee:
SELECT DISTINCT TOP 20 Home_State FROM Sample.Person
WHERE Home_State IN(SELECT DISTINCT TOP 20 Home_State FROM Sample.Employee)
ORDER BY Home_State
The following query returns the first 20 distinct FavoriteColor
values. This reflects the ORDER BY clause sequencing of all of the
rows in Sample.Person. The FavoriteColors field is known to have NULLs,
so one distinct row with FavoriteColors NULL appears at the top of
the collation sequence.
SELECT DISTINCT BY (FavoriteColors) TOP 20 FavoriteColors,Name FROM Sample.Person
ORDER BY FavoriteColors
Also note in the preceding example that because FavoriteColors
is a list field, the collation sequence includes the element length
byte. Thus distinct list values beginning with a three-letter element
(RED) are listed before list values beginning with a four-letter element
(BLUE).