Aggregation

Introduction

Generally an aggregation aggr(expr) processes all matching rows for each aggregation key found in an incoming record (keys are compared using equivalence).

In a regular aggregation (i.e. of the form aggr(expr)), the list of aggregated values is the list of candidate values with all null values removed from it.

Data Setup

SELECT * FROM cypher('graph_name', $$
	CREATE (a:Person {name: 'A', age: 13}),
	(b:Person {name: 'B', age: 33, eyes: "blue"}),
	(c:Person {name: 'C', age: 44, eyes: "blue"}),
	(d1:Person {name: 'D', eyes: "brown"}),
	(d2:Person {name: 'D'}),
	(a)-[:KNOWS]->(b),
	(a)-[:KNOWS]->(c),
	(a)-[:KNOWS]->(d1),
	(b)-[:KNOWS]->(d2),
	(c)-[:KNOWS]->(d2)
$$) as (a agtype);

Auto Group By

To calculate aggregated data, Cypher offers aggregation, analogous to SQL’s GROUP BY.

Aggregating functions take a set of values and calculate An aggregated value over them. Examples are avg() that calculates the average of multiple numeric values, or min() that finds the smallest numeric or string value in a set of values. When we say below that an aggregating function operates on a set of values, we mean these to be the result of the application of the inner expression(such as n.age) to all the records within the same aggregation group.

Aggregation can be computed over all the matching subgraphs, or it can be further divided by introducing grouping keys. These are non-aggregate expressions, that are used to group the values going into the aggregate functions.

Assume we have the following return statement:

SELECT * FROM cypher('graph_name', $$
	MATCH (v:Person)
	RETURN v.name, count(*)
$$) as (grouping_key agtype, count agtype);

count	key
"A"	1
"B"	1
"C"	1
"D"	2
1 row

We have two return expressions: grouping_key, and count(*). The first, grouping_key, is not an aggregate function, and so it will be the grouping key. The latter, count(*) is an aggregate expression. The matching subgraphs will be divided into different buckets, depending on the grouping key. The aggregate function will then be run on these buckets, calculating an aggregate value per bucket.

Sorting on aggregate functions

To use aggregations to sort the result set, the aggregation must be included in the RETURN to be used in the ORDER BY.

SELECT *
FROM cypher('graph_name', $$
	MATCH (me:Person)-[]->(friend:Person)
	RETURN count(friend), me
	ORDER BY count(friend)
$$) as (friends agtype, me agtype);

Distinct aggregation

In a distinct aggregation (i.e. of the form aggr(DISTINCT expr)), the list of aggregated values is the list of candidate values with all null values removed from it. Furthermore, in a distinct aggregation, only one of all equivalent candidate values is included in the list of aggregated values, i.e. duplicates under equivalence are removed.

The DISTINCT operator works in conjunction with aggregation. It is used to make all values unique before running them through an aggregate function.

SELECT *
FROM cypher('graph_name', $$
	MATCH (v:Person)
	RETURN count(DISTINCT v.eyes), count(v.eyes)
$$) as (distinct_eyes agtype, eyes agtype);

distinct_eyes	eyes
2	3
1 row

Ambiguous Grouping Statements

This feature of not requiring the user to specify their grouping keys for a query allows for ambiguity on what Cypher should qualify as their grouping keys. For more details click here.

Data Setup

SELECT * FROM cypher('graph_name', $$
CREATE (:L {a: 1, b: 2, c: 3}),
       (:L {a: 2, b: 3, c: 1}),
       (:L {a: 3, b: 1, c: 2})
$$) as (a agtype);

Invalid Query in AGE

AGE’s solution to this problem is to not allow a WITH or RETURN column to combine aggregate functions with variables that are not explicitly listed in another column of the same WITH or RETURN clause.

Query:

SELECT * FROM cypher('graph_name', $$
	MATCH (x:L)
	RETURN x.a + count(*) + x.b + count(*) + x.c
$$) as (a agtype);

Result:

ERROR:  "x" must be either part of an explicitly listed key or used inside an aggregate function
LINE 3: RETURN x.a + count(*) + x.b + count(*) + x.c

Valid Query in AGE

Columns that do not include an aggregate function in AGE are considered to be the grouping keys for that WITH or RETURN clause.

For the above query, the user could rewrite the query is several ways that will return results

Query:

SELECT * FROM cypher('graph_name', $$
	MATCH (x:L)
	RETURN (x.a + x.b + x.c) + count(*) + count(*), x.a + x.b + x.c
$$) as (count agtype, key agtype);

x.a + x.b + x.c is the grouping key. Grouping keys created like this must include parenthesis.

Results

count	key
12	6
1 row

Query

SELECT * FROM cypher('graph_name', $$
	MATCH (x:L)
	RETURN x.a + count(*) + x.b + count(*) + x.c, x.a, x.b, x.c
$$) as (count agtype, a agtype, b agtype, c agtype);

x.a, x.b, and x.c will be considered different grouping keys

Results:

abc

count
8	3	1	2
8	2	3	1
8	1	2	3
3 rows

Vertices and edges in ambiguous grouping

Alternatively, the grouping key can be a vertex or edge, and then any properties of the vertex or edge can be specified without being explicitly stated in a WITH or RETURN column.

SELECT * FROM cypher('graph_name', $$
	MATCH (x:L)
	RETURN count(*) + count(*) + x.a + x.b + x.c, x
$$) as (count agtype, key agtype);

Results will be grouped on x, because it is safe to assume that properties be considered unecessary for grouping to be unambiguous.

Results

key

count
8	{"id": 1407374883553283, "label": "L", "properties": {"a": 3, "b": 1, "c": 2}}::vertex
8	{"id": 1407374883553281, "label": "L", "properties": {"a": 1, "b": 2, "c": 3}}::vertex
8	{"id": 1407374883553282, "label": "L", "properties": {"a": 2, "b": 3, "c": 1}}::vertex
3 rows

Hiding unwanted grouping keys

If the grouping key is considered unecessary for the query output, the aggregation can be done in a WITH clause then passing information to the RETURN clause.

SELECT * FROM cypher('graph_name', $$
	MATCH (x:L)
	WITH count(*) + count(*) + x.a + x.b + x.c as column, x
	RETURN column
$$) as (a agtype);

Results

a
8
8
8
3 rows