User Tools

Site Tools


sql-derivative-sensitivity-analyser_advanced

Advanced settings

Here we present some additional functionality of SQL combined sensitivity analyzer, which is not essential, but allows to do more. We do not reproduce the formal semantics here, but give some examples instead.

Sensitvity analysis

Smoothness beta

The variable β is an optimization parameter that does not affect privacy at all, but is related to the noise level. In general, smaller β means less noise, but not all values of β will work for a particular query, resulting in an error. The user may choose not to fix β and let the analyzer find a suitable β itself (which depends on ε). The analysis takes more time, but the obtained β can be reused afterwards, and still be reasonable if the query or the data does not change too much.

Defining norms for data tables

A table T's norm is defined either in the text file T.nrm (if the analyzer is run from the command line), or in the text input window table norm that opens after clicking on a data object (if analysis is run from PLEAK web application).

The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. We estimate change in the output when only the sensitive entries may change, and the non-sensitive entries remain the same.

rows: 0 3 7 ;
cols: latitude longitude ;

Here we assume that the columns latitude and longitude are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed. Supported operations are scaling and lp-norms, which can be composed in an arbitrary way.

rows: i_1 i_2 ... i_n ;
cols: attr_1 attr_2 ... attr_n ;

var_1 = op_1 var_11 ... var_1m;
....
var_n = op_n var_n1 ... var_nm;
return op var_n;

As an example, let us consider the following norm definition.

rows: 0 3 7 ;
cols: latitude longitude ;

u = lp 2.0 latitude longitude;
z = scaleNorm 0.2 u;
return linf z;

The line u = lp 2.0 latitude longitude; combines latitude and longitude to define Euclidean distance (i.e l2-norm). We may scale the distances, and 0.2 in the line z = scaleNorm 0.2 u; means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, return linf z; shows how the distance between the tables is computed from the distances between their rows, and linf means that we take the maximum row distance (i.e l-norm), so DP conceals the change even if all sensitive rows change by a unit.

Local and combined sensitivity

In the previous section, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of rows as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark rows: all ; to make it work properly.

rows: all ;
cols: none ;
G: 1 ;

It is possible to combine these two distances into one.

rows: all ;
cols: latitude longitude ;
G: 1.0 ;

Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude have been changed by a unit. More precisely, we define the distance between two tables as a table edit distance (analogous to string edit distance) that uses the following operations:

  • the cost of row insertion/deletion (defined by the line G:).
  • the cost of cell modification (defined by the line cols: and the possible extension).

Table edit distance is defined as the minimal cost of operations required to transform one table into the other.

Group queries

Grouping by a table attribute (but not a complex expression) is allowed. The analyzer expects that the group columns come first, and the aggregation is the last one. Currently supported aggregations are COUNT, SUM, MIN, MAX, and only one aggregation per task is allowed.

SELECT
    table_i1.attr_1 AS group_1,
    ...
    table_ik.attr_k AS group_k,
    query_expression
FROM
    table_1 AS alias_1,
    ...
    table_n AS alias_n
WHERE
    condition_1 AND
    ...
    condition_n
GROUP BY
    table_i1.attr_1,
    ...
    table_ik.attr_k

Internally, a group query is converted into several instances of the same query equipped with a filter w.r.t. corresponding group. Hence, the analyzer needs to know the set of possible group values to generate all these queries. If there are too many groups, the analysis will be slower, and it will also likely result in high noise. The groups can be specified under the tab Table constrains as discussed in the next section.

Table constraints

Clicking on a sub-button Table constraints in the Analysis settings window opens a window tab where the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. For combined sensitivity analysis, it can be used to define sets and/or ranges of values.

table_1.attr_1 range lower_bound upper_bound ;
...
table_n.attr_n set value_1 ... value_n ;

Guessing advantage analyzer

Attacker goal

Attacker goal is given in form of an SQL guery. It defines a set of sensitive components, which the attacker is trying to guess. For each sensitive attribute, the guess can either be exact (discrete attributes), or approx r (approximated by r > 0 units). It is possible to combine several attributes into a vector and define approximation w.r.t. some l_p-norm as approxWrtLp(p) and approxWrtLinf. The filter WHERE condition describes which rows of the considered tables are treated as sensitive. The delimiter ; finishes the description of the attacker goal.

SELECT
(t.x, t.y) approxWrt(5) AND
t.z exact
FROM t
WHERE t.b;

In this example, the attacker wins iff he guesses both t.z exactly and (t.x,t.y) within 5-unit precision w.r.t. l_2-norm of any row of the table t where t.b holds. The definition of “unit” depends on the data table, e.g. if the location was defined in miles, then a unit is also a mile.

If we want to express that the attacker wins if he guesses either t.z or (t.x,t.y), we replace AND operation with OR.

Table constraints

Compared to combined sensitivity analyser, this tab has more use in guessing advantage analysis, allowing more types of constraints. The keywords total, set and range do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords totalUnif, setunif, rangeUnif in addition specify that the distribution is uniform. To provide even more distribution details, setPrior allows to define a probability for each element, and rangePrior allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly).

table_1.attr_1 exact;               --attacker knows the exact value

table_2.attr_2 total int;           --there are n possible values
table_3.attr_3 set v1 ... vn;       --there are values {v1 ... vn}
table_4.attr_4 range lb ub;         --the values come from range [lb,ub)

table_5.attr_5 totalUnif int;       -- there are n uniformly distributed values
table_6.attr_6 setUnif v1 ... vn;   -- uniformly distributed in {v1 ... vn}
table_7.attr_7 rangeUnif lb ub;     -- uniformly distributed in [lb,ub)

-- value vk comes with probability pk
table_8.attr_8 setPrior (v1, p1) ... (vn, pn)  

-- range [v(k-1)...vk) comes with prob.pk
-- the values within [v(k-1)...vk) are distributed uniformly
table_9.attr_9 rangePrior v0 (v1, p1) ... (vn, pn) 
                                     
sql-derivative-sensitivity-analyser_advanced.txt · Last modified: 2019/10/01 13:53 (external edit)