sql-derivative-sensitivity-analyser_advanced

In this page we describe how to specify different parameters and inputs of SQL combined sensitivity analyzer. We do not reproduce the formal semantics here, but give some examples instead.

Assume that we are running the following query from the analyzer command line

dist/build/banach/banach -QDs --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --epsilon 1.0 --beta 0.1 --errorUB 0.9 --delimiter ' '

Based on this example, we describe the files and the numeric inputs used by the analyzer.

The first compulsory parameter specifies the file containing the schema, which is `demo_schema.sql`

in the example. This file contains the contents of all the windows **Table schema** and **Output table schema** if the analysis is run from PLEAK web application.

In this file are listed schemas of all input tables, and optionally, also the output tables.

CREATE TABLE t1 (var_11 type_11, ... var_1n type_1n); ... CREATE TABLE tm (var_m1 type_m1, ... var_mn type_mn);

The second compulsory parameter specifies the file containing the SQL workflow, represented by a list of sequential queries, which is `demo_query.sql`

in the example. This file contains the contents of all the windows **Output table query** if the analysis is run from PLEAK web application.

The intermediate queries are plain SELECT-queries. If the analyzer is executed from the command line, then INSERT INTO syntax is used to specify the name of the corresponding output table. INSERT INTO syntax is not needed if the analysis is run from PLEAK web application.

INSERT INTO output_table SELECT alias_1.attr_1 ... alias_n.attr_k FROM input_table_1 AS alias_1, ... input_table_n AS alias_n WHERE condition_1 AND ... condition_n ;

The last query should be an aggregation. Grouping by a table attribute (but not a complex expression) is allowed. The analyzer expects that the group columns come first, and the aggregation is the last one. Currently supported aggregations are COUNT, SUM, MIN, MAX, and only one aggregation per task is allowed. In general, a query may have the following syntax.

SELECT table_i1.attr_1 AS group_1, ... table_ik.attr_k AS group_k, query_expression FROM table_1 AS alias_1, ... table_n AS alias_n WHERE condition_1 AND ... condition_n GROUP BY table_i1.attr_1, ... table_ik.attr_k ;

Internally, a group query is converted into several instances of the same query equipped with a filter w.r.t. corresponding group. Hence, the analyzer needs to know the set of possible group values to generate all these queries. If there are too many groups, the analysis will be slower, and may result in high noise level. The groups can be specified in *Table constrains* as discussed in the next section.

The third compulsory parameter specifies the file containing constraints on table attributes, which is `demo_constraints.sql`

in the example. This can be opened by clicking on a sub-button *Table constraints* in the *Analysis settings* window if the analysis is run from PLEAK web application.

In this file, the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. One can define sets and/or ranges of values.

table_1.attr_1 range lower_bound upper_bound ; ... table_n.attr_n set value_1 ... value_n ;

The parameter ε (`–epsilon`

) determines the level of differential privacy. Simply put, smaller ε means more privacy. While ε > 0, there does not exist a general upper bound on ε.

The parameter β (`–beta`

) is an optimization parameter that does not affect privacy at all, but is related to the noise level. In general, smaller β means less noise, but not all values of β will work for a particular query, resulting in an error. The user may choose not to fix β and let the analyzer find a suitable β itself (which depends on ε). The optimization takes more time, but the obtained β can be reused afterwards, and still be reasonable if the query or the data does not change too much. For the particular noise generation mechanism that the sensitivity analyzer is using, we have 0 ≤ β < ε / 5.

The analyzer computes the amount of noise required to achieve ε-differential privacy for the given query. As the noise comes from Laplace or Cauchy distribution, there is no upper bound on the additive error. However, we can define a *probabilistic upper bound* under which the error stays with certain probability, specified by the parameter `–errorUB`

. If analysis is run from PLEAK web application, this value is specified by the slider *Confidence level of estimated noise* in the *Analysis settings* window.

If analysis is run from PLEAK web application, then the data is simply inserted into the provided table of the window tab **table data** that opens after clicking on a data object. If the analysis is run from command line, then a table T's data is defined in the text file `T.db`

. The parameter `–delimiter`

specifies the special character used to separate the values, similarly to `.csv`

format. In our example, the delimiter is a whitespace ` `

, which is a default delimiter. In both cases, the first row contains the column names of the corresponding attributes.

var_1 var_2 var_3 x11 x21 x31 x12 x22 x32 x13 x23 x33

A table T's norm is defined in the text file `T.nrm`

if the analysis is run from command line, or in the text input window **table norm** that opens after clicking on a data object if analysis is run from PLEAK web application.

The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. That is, we define differential privacy w.r.t. change in particular table cells.

rows: 0 3 7 ; cols: latitude longitude ;

In this example, we assume that the columns `latitude`

and `longitude`

are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed from the attributes of input tables. Supported operations are scaling `scaleNorm a`

and l_{p}-norms `lp p`

and `linf`

, where a > 0 and p > 1 are constants.

rows: i_1 i_2 ... i_n ; cols: attr_1 attr_2 ... attr_n ; var_1 = op_1 var_11 ... var_1m; .... var_n = op_n var_n1 ... var_nm; return op var_n;

As an example, let us consider the following norm definition.

rows: 0 3 7 ; cols: latitude longitude ; u = lp 2.0 latitude longitude; z = scaleNorm 0.2 u; return linf z;

The line `u = lp 2.0 latitude longitude;`

combines latitude and longitude to define Euclidean distance (i.e l_{2}-norm). We may scale the distances, and 0.2 in the line `z = scaleNorm 0.2 u;`

means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, `return linf z;`

shows how the distance between the tables is computed from the distances between their rows, and `linf`

means that we take the maximum row distance (i.e l_{∞}-norm), so differential privacy should cover the case where the locations of *all* sensitive rows change by 5 units.

In pure derivative sensitivity, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of a row as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark `rows: all ;`

to make it work properly.

rows: all ; cols: none ; G: 1 ;

It is possible to combine these two distances into one.

rows: all ; cols: latitude longitude ; G: 1.0 ;

Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude has been changed by a unit. More precisely, we define the distance between two tables as a *table edit distance* (analogous to string edit distance) that uses the following operations:

- the cost of row insertion/deletion (defined by the line
`G:`

). - the cost of cell modification (defined by the line
`cols:`

and the possible extension).

Table edit distance is defined as the minimal cost of operations required to transform one table into the other.

In the end, the sensitivity analyzer executed from command line reports the following:

- How the noise should be sampled once the noise magnitude is known (this is a constant message that does not depend on the analysis result).

Cauchy noise distribution: add noise 'Cauchy noise'*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4) Laplace noise distribution: add noise 'Laplace noise'*z, where z ~ 1 / 2 * exp(-x)

- The noise magnitudes and various other parameters for particular input tables.

Task: output - sensitivity query_result cauchy_noise cauchy_relative_err beta delta laplace_noise laplace_relative_err norm all input tables together 0.166667 3.66 2.29 62.694603 0.100000 9.08e-5 Infinity Infinity an l_1-norm of all input table norms ship 0.166667 3.66 2.29 62.694603 0.100000 9.08e-5 Infinity Infinity || 0.2* (|| "latitude","longitude" ||_2.0),0.1* (|| "cargo" ||_1.0) ||_1.0, rows l_inf

where:

`sensitivity`

is the ß-smooth sensitivity w.r.t. specified norm`query_result`

is the true query output(s) without noise`cauchy_noise`

is the additive error for Cauchy DP mechanism`cauchy_relative_err`

is the relative error for Cauchy DP mechanism`beta`

is the smoothness ß`delta`

is the δ of (ε,δ)-DP, which we need for Laplace DP mechanism (depends on ε and ß)`laplace_noise`

is the additive error for Laplace DP mechanism`laplace_relative_err`

is the relative error for Laplace DP mechanism`norm`

is the norm w.r.t. which DP is achieved with given parameters

Relative error is defined as the quotient of the noise (additive error) and the true query output. If there are several outputs, then the relative error is the quotient of l_{2}-norms of the additive error vector and the output vector.

Assume that we are running the following query from the analyzer command line

dist/build/banach/banach -QDsp --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --policy=demo_attacker_goal.sql --epsilon 0.3

Based on this example, we describe those files and the numeric inputs that are different from the sensitivity analysis described in the previous section.

Attacker goal is specified in the file given as an optional parameter `–policy`

(in our example `demo_attacker_goal.att`

), or in the window tab that opens on clicking *Attacker goal* under *Analysis settings* window if the analysis is run from PLEAK web application.

Attacker goal is given in form of SQL query with extended syntax. It defines a set of sensitive components, which the attacker is trying to guess. For each sensitive attribute, the guess can either be `exact`

(discrete attributes), or `approx r`

(approximated by r > 0 units). It is possible to combine several attributes into a vector and define approximation w.r.t. some l_p-norm as `approxWrtLp(p)`

and `approxWrtLinf`

. The filter `WHERE condition`

describes which rows of the considered tables are treated as sensitive. The delimiter `;`

finishes the description of the attacker goal.

SELECT (t.x, t.y) approxWrt(5) AND t.z exact FROM t WHERE b;

In this example, the attacker wins iff he guesses *both* `t.z`

exactly and `(t.x,t.y)`

within 5-unit precision w.r.t. l_2-norm of any row of the table `t`

where `t.b`

holds. The definition of “unit” depends on the data table, e.g. if the location was defined in miles, then a unit is also a mile.

If we want to express that the attacker wins if he guesses *either* `t.z`

or `(t.x,t.y)`

, we replace AND operation with OR.

Compared to combined sensitivity analyser, this tab has more use in guessing advantage analysis, allowing more types of constraints. The keywords `total`

, `set`

and `range`

do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords `totalUnif`

, `setunif`

, `rangeUnif`

in addition specify that the distribution is uniform. To provide even more distribution details, `setPrior`

allows to define a probability for each element, and `rangePrior`

allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly). The analyser also supports `normal`

distribution.

table.attr exact; --attacker knows the exact value table.attr total int; --there are n possible values table.attr set v1 ... vn; --there are values {v1 ... vn} table.attr range lb ub; --the values come from range [lb,ub) table.attr totalUnif int; -- there are n uniformly distributed values table.attr setUnif v1 ... vn; -- uniformly distributed in {v1 ... vn} table.attr rangeUnif lb ub; -- uniformly distributed in [lb,ub) -- value vk comes with probability pk table.attr setPrior (v1, p1) ... (vn, pn) -- range [v(k-1)...vk) comes with prob.pk -- the values within [v(k-1)...vk) are distributed uniformly table.attr rangePrior v0 (v1, p1) ... (vn, pn) -- normal distribution with predefined mean and variance -- we have μ=mean, σ=sqrt(variance) table.attr normal mean variance

Differently from ε of differential privacy as in sensitivity analysis, the parameter `–epsilon`

now defines the desired upper bound on *attacker's advantage* in guessing / approximating the values specified in `demo_attacker_goal.att`

, using the knowledge defined in `demo_constraints.att`

. This quantity is defined by the slider *Attacker advantage* if the analysis is run from PLEAK web application.

In the end, the guessing advantage analyzer executed from command line reports the following:

The true query output(s) without noise.

actual outputs y: [3.655]

The additive and the relative errors. The number `78%`

is the confidence level, stated in the parameter `–errorUB`

, which is `0.78`

by default.

78%-noise magnitude a: [114.226] 78%-realtive error |a|/|y|: 3124.968%

How the noise should actually be sampled.

Cauchy (default) distribution: add noise a*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4)

Particular prior and posterior probabilities (worst instance among the analyzed table rows).

prior (worst instance): 35.0% posterior (worst instance): 65.0%

The parameters for conversion between differential privacy and guessing advantage.

DP epsilon: 7.0e-3 smoothness beta: 0.0 (epsilon,delta) for Laplace: (7.0e-3,0.0)

What is the underlying norm w.r.t which DP should be ensured with the previous parameters.

norm N: || 0.2* (|| "ship.longitude","ship.latitude" ||_2.0) ||_inf beta-smooth sensitivity w.r.t. N: [0.167]

Noise level estimate for Laplace mechanism.

78%-noise magnitude a (Laplace): [34.614] 78%-realtive error |a|/|y| (Laplace): 946.96% Laplace noise distribution: add noise a*z, where z ~ 1 / 2 * exp(-x)

sql-derivative-sensitivity-analyser_advanced.txt · Last modified: 2019/12/13 16:54 by alisa