sql-derivative-sensitivity-analyser_advanced

Here we present some additional functionality of SQL combined sensitivity analyzer, which is not essential, but allows to do more. We do not reproduce the formal semantics here, but give some examples instead.

The variable β is an optimization parameter that does not affect privacy at all, but is related to the noise level. In general, smaller β means less noise, but not all values of β will work for a particular query, resulting in an error. The user may choose not to fix β and let the analyzer find a suitable β itself (which depends on ε). The analysis takes more time, but the obtained β can be reused afterwards, and still be reasonable if the query or the data does not change too much.

A table T's norm is defined either in the text file `T.nrm`

(if the analyzer is run from the command line), or in the text input window **table norm** that opens after clicking on a data object (if analysis is run from PLEAK web application).

The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. We estimate change in the output when only the sensitive entries may change, and the non-sensitive entries remain the same.

rows: 0 3 7 ; cols: latitude longitude ;

Here we assume that the columns `latitude`

and `longitude`

are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed. Supported operations are scaling and l_{p}-norms, which can be composed in an arbitrary way.

rows: i_1 i_2 ... i_n ; cols: attr_1 attr_2 ... attr_n ; var_1 = op_1 var_11 ... var_1m; .... var_n = op_n var_n1 ... var_nm; return op var_n;

As an example, let us consider the following norm definition.

rows: 0 3 7 ; cols: latitude longitude ; u = lp 2.0 latitude longitude; z = scaleNorm 0.2 u; return linf z;

The line `u = lp 2.0 latitude longitude;`

combines latitude and longitude to define Euclidean distance (i.e l_{2}-norm). We may scale the distances, and 0.2 in the line `z = scaleNorm 0.2 u;`

means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, `return linf z;`

shows how the distance between the tables is computed from the distances between their rows, and `linf`

means that we take the maximum row distance (i.e l_{∞}-norm), so DP conceals the change even if all sensitive rows change by a unit.

In the previous section, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of rows as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark `rows: all ;`

to make it work properly.

rows: all ; cols: none ; G: 1 ;

It is possible to combine these two distances into one.

rows: all ; cols: latitude longitude ; G: 1.0 ;

Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude have been changed by a unit. More precisely, we define the distance between two tables as a *table edit distance* (analogous to string edit distance) that uses the following operations:

- the cost of row insertion/deletion (defined by the line
`G:`

). - the cost of cell modification (defined by the line
`cols:`

and the possible extension).

Table edit distance is defined as the minimal cost of operations required to transform one table into the other.

Grouping by a table attribute (but not a complex expression) is allowed. The analyzer expects that the group columns come first, and the aggregation is the last one. Currently supported aggregations are COUNT, SUM, MIN, MAX, and only one aggregation per task is allowed.

SELECT table_i1.attr_1 AS group_1, ... table_ik.attr_k AS group_k, query_expression FROM table_1 AS alias_1, ... table_n AS alias_n WHERE condition_1 AND ... condition_n GROUP BY table_i1.attr_1, ... table_ik.attr_k

Internally, a group query is converted into several instances of the same query equipped with a filter w.r.t. corresponding group. Hence, the analyzer needs to know the set of possible group values to generate all these queries. If there are too many groups, the analysis will be slower, and it will also likely result in high noise. The groups can be specified under the tab *Table constrains* as discussed in the next section.

Clicking on a sub-button *Table constraints* in the *Analysis settings* window opens a window tab where the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. For combined sensitivity analysis, it can be used to define sets and/or ranges of values.

table_1.attr_1 range lower_bound upper_bound ; ... table_n.attr_n set value_1 ... value_n ;

Attacker goal is given in form of an SQL guery. It defines a set of sensitive components, which the attacker is trying to guess. For each sensitive attribute, the guess can either be `exact`

(discrete attributes), or `approx r`

(approximated by r > 0 units). It is possible to combine several attributes into a vector and define approximation w.r.t. some l_p-norm as `approxWrtLp(p)`

and `approxWrtLinf`

. The filter `WHERE condition`

describes which rows of the considered tables are treated as sensitive. The delimiter `;`

finishes the description of the attacker goal.

SELECT (t.x, t.y) approxWrt(5) AND t.z exact FROM t WHERE t.b;

In this example, the attacker wins iff he guesses *both* `t.z`

exactly and `(t.x,t.y)`

within 5-unit precision w.r.t. l_2-norm of any row of the table `t`

where `t.b`

holds. The definition of “unit” depends on the data table, e.g. if the location was defined in miles, then a unit is also a mile.

If we want to express that the attacker wins if he guesses *either* `t.z`

or `(t.x,t.y)`

, we replace AND operation with OR.

Compared to combined sensitivity analyser, this tab has more use in guessing advantage analysis, allowing more types of constraints. The keywords `total`

, `set`

and `range`

do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords `totalUnif`

, `setunif`

, `rangeUnif`

in addition specify that the distribution is uniform. To provide even more distribution details, `setPrior`

allows to define a probability for each element, and `rangePrior`

allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly).

table_1.attr_1 exact; --attacker knows the exact value table_2.attr_2 total int; --there are n possible values table_3.attr_3 set v1 ... vn; --there are values {v1 ... vn} table_4.attr_4 range lb ub; --the values come from range [lb,ub) table_5.attr_5 totalUnif int; -- there are n uniformly distributed values table_6.attr_6 setUnif v1 ... vn; -- uniformly distributed in {v1 ... vn} table_7.attr_7 rangeUnif lb ub; -- uniformly distributed in [lb,ub) -- value vk comes with probability pk table_8.attr_8 setPrior (v1, p1) ... (vn, pn) -- range [v(k-1)...vk) comes with prob.pk -- the values within [v(k-1)...vk) are distributed uniformly table_9.attr_9 rangePrior v0 (v1, p1) ... (vn, pn)

sql-derivative-sensitivity-analyser_advanced.txt · Last modified: 2019/10/01 13:53 (external edit)