Table of Contents

Advanced settings

In this page we describe how to specify different parameters and inputs of SQL combined sensitivity analyzer. We do not reproduce the formal semantics here, but give some examples instead.

Sensitvity analysis

Assume that we are running the following query from the analyzer command line

dist/build/banach/banach -QDs --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --epsilon 1.0 --beta 0.1 --errorUB 0.9 --delimiter ' '

Based on this example, we describe the files and the numeric inputs used by the analyzer.

Schemas

The first compulsory parameter specifies the file containing the schema, which is demo_schema.sql in the example. This file contains the contents of all the windows Table schema and Output table schema if the analysis is run from PLEAK web application.

In this file are listed schemas of all input tables, and optionally, also the output tables.

CREATE TABLE t1 (var_11 type_11,  ...  var_1n type_1n);
...
CREATE TABLE tm (var_m1 type_m1,  ...  var_mn type_mn);

Queries

The second compulsory parameter specifies the file containing the SQL workflow, represented by a list of sequential queries, which is demo_query.sql in the example. This file contains the contents of all the windows Output table query if the analysis is run from PLEAK web application.

The intermediate queries are plain SELECT-queries. If the analyzer is executed from the command line, then INSERT INTO syntax is used to specify the name of the corresponding output table. INSERT INTO syntax is not needed if the analysis is run from PLEAK web application.

INSERT INTO output_table
SELECT
    alias_1.attr_1
    ...
    alias_n.attr_k
FROM
    input_table_1 AS alias_1,
    ...
    input_table_n AS alias_n
WHERE
    condition_1 AND
    ...
    condition_n
;

The last query should be an aggregation. Grouping by a table attribute (but not a complex expression) is allowed. The analyzer expects that the group columns come first, and the aggregation is the last one. Currently supported aggregations are COUNT, SUM, MIN, MAX, and only one aggregation per task is allowed. In general, a query may have the following syntax.

SELECT
    table_i1.attr_1 AS group_1,
    ...
    table_ik.attr_k AS group_k,
    query_expression
FROM
    table_1 AS alias_1,
    ...
    table_n AS alias_n
WHERE
    condition_1 AND
    ...
    condition_n
GROUP BY
    table_i1.attr_1,
    ...
    table_ik.attr_k
;

Internally, a group query is converted into several instances of the same query equipped with a filter w.r.t. corresponding group. Hence, the analyzer needs to know the set of possible group values to generate all these queries. If there are too many groups, the analysis will be slower, and may result in high noise level. The groups can be specified in Table constrains as discussed in the next section.

Table constraints

The third compulsory parameter specifies the file containing constraints on table attributes, which is demo_constraints.sql in the example. This can be opened by clicking on a sub-button Table constraints in the Analysis settings window if the analysis is run from PLEAK web application.

In this file, the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. One can define sets and/or ranges of values.

table_1.attr_1 range lower_bound upper_bound ;
...
table_n.attr_n set value_1 ... value_n ;

Epsilon

The parameter ε (–epsilon) determines the level of differential privacy. Simply put, smaller ε means more privacy. While ε > 0, there does not exist a general upper bound on ε.

Smoothness beta

The parameter β (–beta) is an optimization parameter that does not affect privacy at all, but is related to the noise level. In general, smaller β means less noise, but not all values of β will work for a particular query, resulting in an error. The user may choose not to fix β and let the analyzer find a suitable β itself (which depends on ε). The optimization takes more time, but the obtained β can be reused afterwards, and still be reasonable if the query or the data does not change too much. For the particular noise generation mechanism that the sensitivity analyzer is using, we have 0 ≤ β < ε / 5.

Confidence in reported error

The analyzer computes the amount of noise required to achieve ε-differential privacy for the given query. As the noise comes from Laplace or Cauchy distribution, there is no upper bound on the additive error. However, we can define a probabilistic upper bound under which the error stays with certain probability, specified by the parameter –errorUB. If analysis is run from PLEAK web application, this value is specified by the slider Confidence level of estimated noise in the Analysis settings window.

Defining data of the input tables

If analysis is run from PLEAK web application, then the data is simply inserted into the provided table of the window tab table data that opens after clicking on a data object. If the analysis is run from command line, then a table T's data is defined in the text file T.db. The parameter –delimiter specifies the special character used to separate the values, similarly to .csv format. In our example, the delimiter is a whitespace , which is a default delimiter. In both cases, the first row contains the column names of the corresponding attributes.

var_1 var_2 var_3
x11 x21 x31
x12 x22 x32
x13 x23 x33

Defining norms of the input tables

A table T's norm is defined in the text file T.nrm if the analysis is run from command line, or in the text input window table norm that opens after clicking on a data object if analysis is run from PLEAK web application.

The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. That is, we define differential privacy w.r.t. change in particular table cells.

rows: 0 3 7 ;
cols: latitude longitude ;

In this example, we assume that the columns latitude and longitude are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed from the attributes of input tables. Supported operations are scaling scaleNorm a and lp-norms lp p and linf, where a > 0 and p > 1 are constants.

rows: i_1 i_2 ... i_n ;
cols: attr_1 attr_2 ... attr_n ;

var_1 = op_1 var_11 ... var_1m;
....
var_n = op_n var_n1 ... var_nm;
return op var_n;

As an example, let us consider the following norm definition.

rows: 0 3 7 ;
cols: latitude longitude ;

u = lp 2.0 latitude longitude;
z = scaleNorm 0.2 u;
return linf z;

The line u = lp 2.0 latitude longitude; combines latitude and longitude to define Euclidean distance (i.e l2-norm). We may scale the distances, and 0.2 in the line z = scaleNorm 0.2 u; means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, return linf z; shows how the distance between the tables is computed from the distances between their rows, and linf means that we take the maximum row distance (i.e l-norm), so differential privacy should cover the case where the locations of all sensitive rows change by 5 units.

Local and combined sensitivity

In pure derivative sensitivity, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of a row as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark rows: all ; to make it work properly.

rows: all ;
cols: none ;
G: 1 ;

It is possible to combine these two distances into one.

rows: all ;
cols: latitude longitude ;
G: 1.0 ;

Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude has been changed by a unit. More precisely, we define the distance between two tables as a table edit distance (analogous to string edit distance) that uses the following operations:

Table edit distance is defined as the minimal cost of operations required to transform one table into the other.

Analysis output

In the end, the sensitivity analyzer executed from command line reports the following:

Cauchy noise distribution:  add noise 'Cauchy  noise'*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4)
Laplace noise distribution: add noise 'Laplace noise'*z, where z ~ 1 / 2 * exp(-x)
Task: output
-                             sensitivity     query_result     cauchy_noise     cauchy_relative_err     beta         delta       laplace_noise     laplace_relative_err     norm                                                                                  
all input tables together     0.166667        3.66             2.29             62.694603               0.100000     9.08e-5     Infinity          Infinity                 an l_1-norm of all input table norms                                                  
ship                          0.166667        3.66             2.29             62.694603               0.100000     9.08e-5     Infinity          Infinity                 || 0.2* (|| "latitude","longitude" ||_2.0),0.1* (|| "cargo" ||_1.0) ||_1.0, rows l_inf

where:

Relative error is defined as the quotient of the noise (additive error) and the true query output. If there are several outputs, then the relative error is the quotient of l2-norms of the additive error vector and the output vector.

Guessing advantage analyzer

Assume that we are running the following query from the analyzer command line

dist/build/banach/banach -QDsp --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --policy=demo_attacker_goal.sql --epsilon 0.3

Based on this example, we describe those files and the numeric inputs that are different from the sensitivity analysis described in the previous section.

Attacker goal

Attacker goal is specified in the file given as an optional parameter –policy (in our example demo_attacker_goal.att ), or in the window tab that opens on clicking Attacker goal under Analysis settings window if the analysis is run from PLEAK web application.

Attacker goal is given in form of SQL query with extended syntax. It defines a set of sensitive components, which the attacker is trying to guess. For each sensitive attribute, the guess can either be exact (discrete attributes), or approx r (approximated by r > 0 units). It is possible to combine several attributes into a vector and define approximation w.r.t. some l_p-norm as approxWrtLp(p) and approxWrtLinf. The filter WHERE condition describes which rows of the considered tables are treated as sensitive. The delimiter ; finishes the description of the attacker goal.

SELECT
(t.x, t.y) approxWrt(5) AND
t.z exact
FROM t
WHERE b;

In this example, the attacker wins iff he guesses both t.z exactly and (t.x,t.y) within 5-unit precision w.r.t. l_2-norm of any row of the table t where t.b holds. The definition of “unit” depends on the data table, e.g. if the location was defined in miles, then a unit is also a mile.

If we want to express that the attacker wins if he guesses either t.z or (t.x,t.y), we replace AND operation with OR.

Table constraints

Compared to combined sensitivity analyser, this tab has more use in guessing advantage analysis, allowing more types of constraints. The keywords total, set and range do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords totalUnif, setunif, rangeUnif in addition specify that the distribution is uniform. To provide even more distribution details, setPrior allows to define a probability for each element, and rangePrior allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly). The analyser also supports normal distribution.

table.attr exact;               --attacker knows the exact value

table.attr total int;           --there are n possible values
table.attr set v1 ... vn;       --there are values {v1 ... vn}
table.attr range lb ub;         --the values come from range [lb,ub)

table.attr totalUnif int;       -- there are n uniformly distributed values
table.attr setUnif v1 ... vn;   -- uniformly distributed in {v1 ... vn}
table.attr rangeUnif lb ub;     -- uniformly distributed in [lb,ub)

-- value vk comes with probability pk
table.attr setPrior (v1, p1) ... (vn, pn)  

-- range [v(k-1)...vk) comes with prob.pk
-- the values within [v(k-1)...vk) are distributed uniformly
table.attr rangePrior v0 (v1, p1) ... (vn, pn) 

-- normal distribution with predefined mean and variance
-- we have μ=mean, σ=sqrt(variance)
table.attr normal mean variance
                                     

Epsilon

Differently from ε of differential privacy as in sensitivity analysis, the parameter –epsilon now defines the desired upper bound on attacker's advantage in guessing / approximating the values specified in demo_attacker_goal.att, using the knowledge defined in demo_constraints.att. This quantity is defined by the slider Attacker advantage if the analysis is run from PLEAK web application.

Analysis output

In the end, the guessing advantage analyzer executed from command line reports the following:

The true query output(s) without noise.

actual outputs y: [3.655]

The additive and the relative errors. The number 78% is the confidence level, stated in the parameter –errorUB, which is 0.78 by default.

78%-noise magnitude a: [114.226]
78%-realtive error |a|/|y|: 3124.968%

How the noise should actually be sampled.

Cauchy (default) distribution: add noise a*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4)

Particular prior and posterior probabilities (worst instance among the analyzed table rows).

prior (worst instance): 35.0%
posterior (worst instance): 65.0%

The parameters for conversion between differential privacy and guessing advantage.

DP epsilon: 7.0e-3
smoothness beta: 0.0
(epsilon,delta) for Laplace: (7.0e-3,0.0)

What is the underlying norm w.r.t which DP should be ensured with the previous parameters.

norm N: || 0.2* (|| "ship.longitude","ship.latitude" ||_2.0) ||_inf
beta-smooth sensitivity w.r.t. N: [0.167]

Noise level estimate for Laplace mechanism.

78%-noise magnitude a (Laplace): [34.614]
78%-realtive error |a|/|y| (Laplace): 946.96%
Laplace noise distribution: add noise a*z, where z ~ 1 / 2 * exp(-x)