Differences

This shows you the differences between two versions of the page.

--- sql-derivative-sensitivity-analyser_advanced [2019/09/26 16:45]
alisa
+++ sql-derivative-sensitivity-analyser_advanced [2019/12/13 16:54] (current)
alisa [Sensitvity analysis]
@@ Line 1: / Line 1: @@
 ====== Advanced settings ======
-Here we present some additional functionality of SQL combined sensitivity analyzer, which is not essential, but allows to do more. We do not reproduce the formal semantics here, but give some examples instead.
+In this page we describe how to specify different parameters and inputs of SQL combined sensitivity analyzer. We do not reproduce the formal semantics here, but give some examples instead.
-==== Defining norms for data tables ====
+==== Sensitvity analysis ====
-A table T's norm is defined either in the text file ''T.nrm'' (if the analyzer is run from the command line), or in the text input window **table norm** that opens after clicking on a data object (if analysis is run from PLEAK web application).
+Assume that we are running the following query from the analyzer command line
+<code>
+dist/build/banach/banach -QDs --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --epsilon 1.0 --beta 0.1 --errorUB 0.9 --delimiter ' '
+</code>
+Based on this example, we describe the files and the numeric inputs used by the analyzer.
-The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. We estimate change in the output when only the sensitive entries may change, and the non-sensitive entries remain the same.
+=== Schemas ===
+The first compulsory parameter specifies the file containing the schema, which is ''demo_schema.sql'' in the example. This file contains the contents of all the windows **Table schema** and **Output table schema** if the analysis is run from PLEAK web application.
+In this file are listed schemas of all input tables, and optionally, also the output tables.
+<code>
+CREATE TABLE t1 (var_11 type_11,  ...  var_1n type_1n);
+...
+CREATE TABLE tm (var_m1 type_m1,  ...  var_mn type_mn);
+</code>
+=== Queries ===
+The second compulsory parameter specifies the file containing the SQL workflow, represented by a list of sequential queries, which is ''demo_query.sql'' in the example. This file contains the contents of all the windows **Output table query** if the analysis is run from PLEAK web application.
+The intermediate queries are plain SELECT-queries. If the analyzer is executed from the command line, then INSERT INTO syntax is used to specify the name of the corresponding output table. INSERT INTO syntax is not needed if the analysis is run from PLEAK web application.
+<code>
+INSERT INTO output_table
+SELECT
+    alias_1.attr_1
+    ...
+    alias_n.attr_k
+FROM
+    input_table_1 AS alias_1,
+    ...
+    input_table_n AS alias_n
+WHERE
+    condition_1 AND
+    ...
+    condition_n
+;
+</code>
+The last query should be an aggregation. Grouping by a table attribute (but not a complex expression) is allowed. The analyzer expects that the group columns come first, and the aggregation is the last one. Currently supported aggregations are COUNT, SUM, MIN, MAX, and only one aggregation per task is allowed. In general, a query may have the following syntax.
+<code>
+SELECT
+    table_i1.attr_1 AS group_1,
+    ...
+    table_ik.attr_k AS group_k,
+    query_expression
+FROM
+    table_1 AS alias_1,
+    ...
+    table_n AS alias_n
+WHERE
+    condition_1 AND
+    ...
+    condition_n
+GROUP BY
+    table_i1.attr_1,
+    ...
+    table_ik.attr_k
+;
+</code>
+Internally, a group query is converted into several instances of the same query equipped with a filter w.r.t. corresponding group. Hence, the analyzer needs to know the set of possible group values to generate all these queries. If there are too many groups, the analysis will be slower, and may result in high noise level. The groups can be specified in //Table constrains// as discussed in the next section.
+=== Table constraints ===
+The third compulsory parameter specifies the file containing constraints on table attributes, which is ''demo_constraints.sql'' in the example. This can be opened by clicking on a sub-button //Table constraints// in the //Analysis settings// window if the analysis is run from PLEAK web application.
+In this file, the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. One can define sets and/or ranges of values.
+<code>
+table_1.attr_1 range lower_bound upper_bound ;
+...
+table_n.attr_n set value_1 ... value_n ;
+</code>
+=== Epsilon ===
+The parameter ε (''--epsilon'') determines the level of differential privacy. Simply put, smaller ε means more privacy. While ε > 0, there does not exist a general upper bound on ε.
+=== Smoothness beta ===
+The parameter β (''--beta'') is an optimization parameter that does not affect privacy at all, but is related to the noise level. In general, smaller β means less noise, but not all values of β will work for a particular query, resulting in an error. The user may choose not to fix β and let the analyzer find a suitable β itself (which depends on ε). The optimization takes more time, but the obtained β can be reused afterwards, and still be reasonable if the query or the data does not change too much. For the particular noise generation mechanism that the sensitivity analyzer is using, we have 0 ≤ β < ε / 5.
+=== Confidence in reported error ===
+The analyzer computes the amount of noise required to achieve ε-differential privacy for the given query. As the noise comes from Laplace or Cauchy distribution, there is no upper bound on the additive error. However, we can define a //probabilistic upper bound// under which the error stays with certain probability, specified by the parameter ''--errorUB''. If analysis is run from PLEAK web application, this value is specified by the slider //Confidence level of estimated noise// in the //Analysis settings// window.
+=== Defining data of the input tables ===
+If analysis is run from PLEAK web application, then the data is simply inserted into the provided table of the window tab **table data** that opens after clicking on a data object. If the analysis is run from command line, then a table T's data is defined in the text file ''T.db''. The parameter ''--delimiter'' specifies the special character used to separate the values, similarly to ''.csv'' format. In our example, the delimiter is a whitespace '' '', which is a default delimiter. In both cases, the first row contains the column names of the corresponding attributes.
+<code>
+var_1 var_2 var_3
+x11 x21 x31
+x12 x22 x32
+x13 x23 x33
+</code>
+=== Defining norms of the input tables ===
+A table T's norm is defined in the text file ''T.nrm'' if the analysis is run from command line, or in the text input window **table norm** that opens after clicking on a data object if analysis is run from PLEAK web application.
+The first two lines tell which rows (indexed starting from 0) and which columns (identified by corresponding attribute name) are treated as sensitive. That is, we define differential privacy w.r.t. change in particular table cells.
 <code>
 rows: 0 3 7 ;
 cols: latitude longitude ;
 </code>
+In this example, we assume that the columns ''latitude'' and ''longitude'' are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed from the attributes of input tables. Supported operations are scaling ''scaleNorm a'' and l<sub>p</sub>-norms ''lp p'' and ''linf'', where a > 0 and p > 1 are constants.
-Here we assume that the columns ''latitude'' and ''longitude'' are sensitive in the rows indexed by 0, 3, and 7. It is possible to define more sophisticated sensitive components. For this, the norm description is extended by a sequence of variable assignments, denoting how the norm is computed. Supported operations are scaling and l<sub>p</sub>-norms, which can be composed in an arbitrary way.
 <code>
@@ Line 37: / Line 132: @@
 </code>
-The line ''u = lp 2.0 latitude longitude;'' combines latitude and longitude to define Euclidean distance (i.e l<sub>2</sub>-norm). We may scale the distances, and 0.2 in the line ''z = scaleNorm 0.2 u;'' means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, ''return linf z;'' shows how the distance between the tables is computed from the distances between their rows, and ''linf'' means that we take the maximum row distance (i.e l<sub>∞</sub>-norm), so DP conceals the change even if all sensitive rows change by a unit.
+The line ''u = lp 2.0 latitude longitude;'' combines latitude and longitude to define Euclidean distance (i.e l<sub>2</sub>-norm). We may scale the distances, and 0.2 in the line ''z = scaleNorm 0.2 u;'' means that we conceal changes in location up to 1 / 0.2 = 5 units. Finally, ''return linf z;'' shows how the distance between the tables is computed from the distances between their rows, and ''linf'' means that we take the maximum row distance (i.e l<sub>∞</sub>-norm), so differential privacy should cover the case where the locations of //all// sensitive rows change by 5 units.
-==== Local and combined sensitivity ====
+=== Local and combined sensitivity ===
-In the previous section, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of rows as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark ''rows: all ;'' to make it work properly.
+In pure derivative sensitivity, we considered differential privacy w.r.t. change in some particular cells of the data tables. The number of rows was considered immutable. Instead of (or in addition to) tracking changes in certain cells, we may also use traditional definition of DP, considering addition and removal of a row as a unit change. For this, we need to define the cost of adding or removing a row, defined by a variable G. It assumes that all rows are sensitive, so we need to mark ''rows: all ;'' to make it work properly.
 <code>
@@ Line 57: / Line 152: @@
 </code>
-Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude have been changed by a unit. More precisely, we define the distance between two tables as a //table edit distance// (analogous to string edit distance) that uses the following operations:
+Intuitively, this means that both types of changes are allowed. In this example, differential privacy conceals the facts that a row has been added or removed, as well as that the latitude or longitude has been changed by a unit. More precisely, we define the distance between two tables as a //table edit distance// (analogous to string edit distance) that uses the following operations:
   * the cost of row insertion/deletion (defined by the line ''G:'').
   * the cost of cell modification (defined by the line ''cols:'' and the possible extension).
@@ Line 63: / Line 158: @@
 Table edit distance is defined as the minimal cost of operations required to transform one table into the other.
+=== Analysis output ===
-==== Table constraints ====
+In the end, the sensitivity analyzer executed from command line reports the following:
+  * How the noise should be sampled once the noise magnitude is known (this is a constant message that does not depend on the analysis result).
+<code>
+Cauchy noise distribution:  add noise 'Cauchy  noise'*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4)
+Laplace noise distribution: add noise 'Laplace noise'*z, where z ~ 1 / 2 * exp(-x)
+</code>
+  * The noise magnitudes and various other parameters for particular input tables.
+<code>
+Task: output
+-                             sensitivity     query_result     cauchy_noise     cauchy_relative_err     beta         delta       laplace_noise     laplace_relative_err     norm
+all input tables together     0.166667        3.66             2.29             62.694603               0.100000     9.08e-5     Infinity          Infinity                 an l_1-norm of all input table norms
+ship                          0.166667        3.66             2.29             62.694603               0.100000     9.08e-5     Infinity          Infinity                 || 0.2* (|| "latitude","longitude" ||_2.0),0.1* (|| "cargo" ||_1.0) ||_1.0, rows l_inf
+</code>
+where:
+  * ''sensitivity'' is the ß-smooth sensitivity w.r.t. specified norm
+  * ''query_result'' is the true query output(s) without noise
+  * ''cauchy_noise'' is the additive error for Cauchy DP mechanism
+  * ''cauchy_relative_err'' is the relative error for Cauchy DP mechanism
+  * ''beta'' is the smoothness ß
+  * ''delta'' is the δ of (ε,δ)-DP, which we need for Laplace DP mechanism (depends on ε and ß)
+  * ''laplace_noise'' is the additive error for Laplace DP mechanism
+  * ''laplace_relative_err'' is the relative error for Laplace DP mechanism
+  * ''norm'' is the norm w.r.t. which DP is achieved with given parameters
-Clicking on a sub-button //Table constraints// opens a window tab where the user can define known bounds on data, which may reduce the noise or even turn an infeasible result into feasible. These settings are essential for GROUP BY queries, defining the total number of possible groups. For [[sql-derivative-sensitivity-analyser|combined sensitivity analyser]], it can be used to define sets and/or ranges of values.
+Relative error is defined as the quotient of the noise (additive error) and the true query output. If there are several outputs, then the relative error is the quotient of l<sub>2</sub>-norms of the additive error vector and the output vector.
+==== Guessing advantage analyzer ====
+Assume that we are running the following query from the analyzer command line
 <code>
-table_1.attr_1 range lower_bound upper_bound ;
+dist/build/banach/banach -QDsp --db-create-tables demo_schema.sql demo_query.sql demo_constraints.att --policy=demo_attacker_goal.sql --epsilon 0.3
-...
-table_n.attr_n set value_1 ... value_n ;
 </code>
+Based on this example, we describe those files and the numeric inputs that are different from the sensitivity analysis described in the previous section.
+=== Attacker goal ===
+Attacker goal is specified in the file given as an optional parameter ''--policy'' (in our example ''demo_attacker_goal.att'' ), or in the window tab that opens on clicking //Attacker goal// under //Analysis settings// window if the analysis is run from PLEAK web application.
-This tab has more use in [[sql-guessing-advantage-analyser|guessing advantage analysis]], allowing more types of constraints. The keywords ''total'', ''set'' and ''range'' do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords ''totalUnif'', ''setunif'', ''rangeUnif'' in addition specify that the distribution is uniform. To provide even more distribution details, ''setPrior'' allows to define a probability for each element, and ''rangePrior'' allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly).
+Attacker goal is given in form of SQL query with extended syntax. It defines a set of sensitive components, which the attacker is trying to guess. For each sensitive attribute, the guess can either be ''exact'' (discrete attributes), or ''approx r'' (approximated by r > 0 units). It is possible to combine several attributes into a vector and define approximation w.r.t. some l_p-norm as ''approxWrtLp(p)'' and ''approxWrtLinf''. The filter ''WHERE condition'' describes which rows of the considered tables are treated as sensitive. The delimiter '';'' finishes the description of the attacker goal.
 <code>
-table_1.attr_1 exact;               --attacker knows the exact value
+SELECT
+(t.x, t.y) approxWrt(5) AND
+t.z exact
+FROM t
+WHERE b;
+</code>
+In this example, the attacker wins iff he guesses //both// ''t.z'' exactly and ''(t.x,t.y)'' within 5-unit precision w.r.t. l_2-norm of any row of the table ''t'' where ''t.b'' holds. The definition of "unit" depends on the data table, e.g. if the location was defined in miles, then a unit is also a mile.
-table_2.attr_2 total int;           --there are n possible values
+If we want to express that the attacker wins if he guesses //either// ''t.z'' or ''(t.x,t.y)'', we replace AND operation with OR.
-table_3.attr_3 set v1 ... vn;       --there are values {v1 ... vn}
-table_4.attr_4 range lb ub;         --the values come from range [lb,ub)
-table_5.attr_5 totalUnif int;       -- there are n uniformly distributed values
+=== Table constraints ===
-table_6.attr_6 setUnif v1 ... vn;   -- uniformly distributed in {v1 ... vn}
-table_7.attr_7 rangeUnif lb ub;     -- uniformly distributed in [lb,ub)
+Compared to [[sql-derivative-sensitivity-analyser|combined sensitivity analyser]], this tab has more use in [[sql-guessing-advantage-analyser|guessing advantage analysis]], allowing more types of constraints. The keywords ''total'', ''set'' and ''range'' do not specify any probability distribution on the data, and the analyser assumes worst-case distribution by default (i.e. one for which the advantage is the largest). The keywords ''totalUnif'', ''setunif'', ''rangeUnif'' in addition specify that the distribution is uniform. To provide even more distribution details, ''setPrior'' allows to define a probability for each element, and ''rangePrior'' allows to split a range into n blocks, defining a different weight to each block (inside each block, the elements are distributed uniformly). The analyser also supports ''normal'' distribution.
+<code>
+table.attr exact;               --attacker knows the exact value
+table.attr total int;           --there are n possible values
+table.attr set v1 ... vn;       --there are values {v1 ... vn}
+table.attr range lb ub;         --the values come from range [lb,ub)
+table.attr totalUnif int;       -- there are n uniformly distributed values
+table.attr setUnif v1 ... vn;   -- uniformly distributed in {v1 ... vn}
+table.attr rangeUnif lb ub;     -- uniformly distributed in [lb,ub)
 -- value vk comes with probability pk
-table_8.attr_8 setPrior (v1, p1) ... (vn, pn)
+table.attr setPrior (v1, p1) ... (vn, pn)
 -- range [v(k-1)...vk) comes with prob.pk
 -- the values within [v(k-1)...vk) are distributed uniformly
-table_9.attr_9 rangePrior v0 (v1, p1) ... (vn, pn)
+table.attr rangePrior v0 (v1, p1) ... (vn, pn)
+-- normal distribution with predefined mean and variance
+-- we have μ=mean, σ=sqrt(variance)
+table.attr normal mean variance
+</code>
+=== Epsilon ===
+Differently from ε of differential privacy as in sensitivity analysis, the parameter ''--epsilon'' now defines the desired upper bound on //attacker's advantage// in guessing / approximating the values specified in ''demo_attacker_goal.att'', using the knowledge defined in ''demo_constraints.att''. This quantity is defined by the slider //Attacker advantage// if the analysis is run from PLEAK web application.
+=== Analysis output ===
+In the end, the guessing advantage analyzer executed from command line reports the following:
+The true query output(s) without noise.
+<code>
+actual outputs y: [3.655]
+</code>
+The additive and the relative errors. The number ''78%'' is the confidence level, stated in the parameter ''--errorUB'', which is ''0.78'' by default.
+<code>
+%-noise magnitude a: [114.226]
+%-realtive error |a|/|y|: 3124.968%
+</code>
+How the noise should actually be sampled.
+<code>
+Cauchy (default) distribution: add noise a*z, where z ~ sqrt(2) / pi * 1 / (1 + |x|^4)
+</code>
+Particular prior and posterior probabilities (worst instance among the analyzed table rows).
+<code>
+prior (worst instance): 35.0%
+posterior (worst instance): 65.0%
+</code>
+The parameters for conversion between differential privacy and guessing advantage.
+<code>
+DP epsilon: 7.0e-3
+smoothness beta: 0.0
+(epsilon,delta) for Laplace: (7.0e-3,0.0)
+</code>
+What is the underlying norm w.r.t which DP should be ensured with the previous parameters.
+<code>
+norm N: || 0.2* (|| "ship.longitude","ship.latitude" ||_2.0) ||_inf
+beta-smooth sensitivity w.r.t. N: [0.167]
+</code>
+Noise level estimate for Laplace mechanism.
+<code>
+%-noise magnitude a (Laplace): [34.614]
+%-realtive error |a|/|y| (Laplace): 946.96%
+Laplace noise distribution: add noise a*z, where z ~ 1 / 2 * exp(-x)
 </code>

PLEAK Wiki

User Tools

Site Tools

Differences

Page Tools