I. INTRODUCTION
Environmental hazards of natural origin involve large extensions of land such as earthquakes, tsunamis, volcano eruptions, landslides and forest fires are common in countries like Chile and produce emergency scenarios where roads are often saturated or damaged and power supplies are down, disrupting connectivity. These hazards can easily affect a large number of people and isolate them from their surrounding environment. While information and storage capabilities are becoming virtually limitless, in such situations, accessing the right information at the right time by the right organization is a crucial requirement to take proper decisions and to publish highly relevant information to the affected communities and helpers in charge of handling the emergency situations [1]. Decision makers usually require access to highly accurate information servers and data application to estimate the number of affected citizens in a certain region and the best available ways to support them.
Environmental monitoring is one of the areas, which attracts public concern. The advance of cloud computing and Internet of things reshaped the manner in which the sensed information is being managed and accessed. The advances in sensor technologies have accelerated the emergence of environmental sensing service. These new services grasp the significance of new techniques in order to understand the complexities and relations in the collected sensed information. Particularly, it utilizes portable sensing devices to extend the sensing range, and cloud-computing environments to analyse the big amount of data collected by various Internet of multimedia things (IoMT) networks in a productive form. Various kinds of sensors are being deployed in the environment as the physical foundation for most of the environmental sensing services. It is highly desirable to link the sensed data with external data collected from different services in order to increase the accuracy of the predictions [2] In regions with environmental hazards, a large number of citizens makes intensive observations about these regions using their mobile phone during their daily activities. This massive data is expected to be generated from different sources and published on various Internet of multimedia things (IoMT) context-based services such as Facebook®, Waze® and Foursquare®. In such situation, it is beneficial to include such data in the decision-making process of environmental monitoring services. In this context, Data Mash-up services appear as a promising tool to accumulates this data and manage in an appropriate way.
Data mashup [3] is a web technology that combines information from multiple sources into a single web application for specific task or request. Mashup technology was first introduced in [4] and since then it creates a new horizon for service providers to integrate their data to deliver highly customizable services to their customers [3]. Data mashup can be used to merge datasets from external IoMT context-based services to leverage the monitoring service from different perspectives like providing more precise predictions and performance, and alleviating cold start problems [5] for new environmental monitoring services. Due to that, Providers of the next generation environmental monitoring services keen to gain accurate data mash-up services for their systems. However, privacy is an essential concern for the application of mashup in IoMT-enabled environmental monitoring, as the generated insights obviously require the integration of different behavioural and neighbouring environment data of citizens and from multiple IoMT context-based services. This might reveal private citizens’ behaviours that were not available before the data mashup. A serious privacy breach can occur if the same citizen is registered on multiple sites, so adversaries can try to deanonymize the citizen’s identity by correlating the information contained in the mashuped data with other information obtained from external public databases. These breaches prevent IoMT context based services to reveal raw behavioural data of the citizen to each other or to the mashup service. Moreover, divulgence citizens’ data represent infringement against personal privacy laws that might be applied in some countries where these sites operate. As a result, if the citizens know their raw data are revealed to other parties, they will absolutely distrust this site. According to surveys results in [6, 7] the users might leave a service provider because of privacy concerns.
In this work, we proposed Fog-based middleware for private data mashup (FMPM) that bear in mind privacy issues related to mashup multiple datasets from IoMT context-based services for environmental monitoring purposes. We focus on stages related to datasets collection and processing and omit all aspects related to environmental monitoring, mainly because these stages are critical with regard to privacy as they involve different entities. We present two concealment algorithms to protect citizens’ privacy and preserve the aggregates in the mashuped datasets in order to maximize usability and attain accurate insights. Using these algorithms, each party involved in the mashup is given a complete control on the privacy of its dataset. In the rest of this paper, we will generically refer to behavioural and neighbouring environment data as Items. Section II describes some related work. In section III we introduce IoMT-enabled data mashup network scenario landing our FMPM. In section IV introduces the proposed concealment algorithms used in our FMPM. In section V describes some experiments and results based on concealment algorithms for IoT context-based services. Finally, Section VI includes conclusions and future work.
II. RELATED WORK
The majority of the literature addresses the problem of privacy on third-party services [8-13]; Due to it is a potential source of leakage of personally identifiable Information. However, a few works have studied the privacy for mashup services. The work in [3] discussed a private data mashup system, where the authors formalize the problem as achieving a k-anonymity on the integrated data without revealing detailed information about this process or disclosing data from one party to another. In [14] it is proposed a theoretical framework to preserve the privacy of customers and the commercial interests of merchants. Their system is a hybrid recommender that uses secure two-party protocols with public key infrastructure to achieve the desired goals. In [15, 16] it is suggested another method for privacy preserving on centralized services by adding uncertainty to the data, using a randomized perturbation technique while attempting to make sure that necessary statistical aggregates don’t get disturbed much. Hence, the server has no knowledge about true values of individual data for each user. They demonstrate that this method does not decrease essentially the obtained accuracy of the results. But recent research work [17, 18] pointed out that these techniques don’t provide levels of privacy as it was previously thought. In [18] it is Pointed out that arbitrary randomization is not safe because it is easy to breach the privacy protection it offers. They proposed a random matrix based spectral filtering techniques to recover the original data from perturbed data. Their experiments revealed that in many cases random perturbation techniques preserve very little privacy.
III. DATA MASHUP IN IOT-ENABLED ENVIRONMENTAL MONITORING SCENARIO
We consider the scenario where the IoMT-enabled data mashup service (IoMT-enabled DMS) integrates datasets from multiple IoMT context-based services for the IoT-enabled environmental monitoring; figures (1) and (2) illustrates the scenario used in this work. We assume all the involved parties follow the semi-honest model, which is a realistic assumption because each party needs to accomplish some business goals and increases its revenues. Also, we assume all parties involved in the data mashup have similar items set (activities’ catalogue) but the users’ sets are not identical. Each IoT context-based service has its own ETL (Extract, Transform, Load) service that has the ability to learn behavioural and neighbouring environment data of citizens.
The data mashup process based on FMPM can be summarized as follows; the environmental cognition service sends a query to the IoMT-enabled DMS to gather information related to behavioural and neighbouring environment data of citizens in a specific region to leverage its predictions and performance. The coordinator Agent in IoMT-enabled DMS lookup in its providers’ cache to determine the providers could satisfy that query then it transforms query of the environmental cognition service into appropriate sub-queries languages suitable for each provider’s database. The manager agent unit sends each sub-query to the candidate providers to incite them about the data mashup process. The provider who decides to participate in that process, forwards the sub-query to its manager agent to refine it considering its privacy preferences. This step allows the manager agent to audit all issued sub-queries and prevent ones that can extract sensitive information. The resulting dataset sent to the local concealment agent (LOA) to hide real participants’ data using the appropriate concealment algorithm. Then, every synchronization agents at each provider along with the coordinator agent engage in distributed joint process to identify frequent and partially frequent items in each dataset, then send the joined results to the coordinator. The coordinator agent builds a virtualized schema for the datasets and submits it to each provider involved in the mashup process. Based on this virtualized schema, the providers incite their global concealment agent (GOA) to start the appropriate concealment algorithm on the locally concealed datasets. Finally, the providers submit all the resulting datasets to IoMT-enabled DMS that in turn unites these results and delivers them to the environmental cognition service. The environmental cognition service uses these datasets to accomplish the required data analytics goals. We use anonymous pseudonyms identities to alleviate providers’ identity problems, as the database providers does not want to reveal their ownership of the data to competing providers moreover the IoMT-enabled DMS will keen to hide the identities of providers as a business asset.
IV. PROPOSED CONCEALMENT ALGORITHMS
In the next sub-sections, we introduce our proposed algorithms used to preserve the privacy of the resulting datasets with minimum loss of accuracy.
A closer look at the attack model proposed in [19] reveals that, if a set of behavioural and neighbouring environment data of certain citizen is fully distinguishable from the data of other citizens in the dataset with respect to some features. This citizen can be identified if an attacker correlates the revealed data with data from other publicly accessible databases. Therefore, it is highly desirable that the dataset has at least a minimum number of items should have a similar feature vector to every real item released by each participant. A real item in the released dataset can be described by a certain number of features in a feature vector, such as place of activity, type of activity, duration, time, date and so on. Both implicit and explicit ways can be used to extract this information and to construct these feature vectors and to maintain them. Additionally, the data sparsity problem associated with ETL services can be used to formulate some attacks as also shown in [19]. Before starting, we introduce a couple of relevant definitions.
Definition 1 (Dissimilarity measure): this metric measures the amount of divergence between two items with respect to their feature vector. We use the notation Dm (Iu, In) to denote the dissimilarity measure between items Iu and In based on the feature vector of each item. Dm (Iu, In) < δ ⇒ Iu ∼ In [Iu is similar to In], δ is a user defined threshold value.
Definition 2 (Affinity group): the set of items that are similar to item Iu with respect to pth attribute Ap of the feature vector and it is called affinity group of Iu and denoted by CAp (Iu).
Definition 3 (K-Similar item group): Let Dϖ be the real items dataset and its locally concealed version. We say satisfies the property of k-Similar item group (where K is defined value) provided for every item Iu ∈ Dϖ. There is at least k-1 other distinct fake items In1, … In(k-1) ∈ Dn forming affinity group such that:
Our motivation to propose CBO is the limitation of the current anonymity models. The current anonymity models proposed in the literature failed to provide an overall anonymity as they don’t consider matching items based on their feature’ vectors. CBO uses the feature vectors of the current real items to select fake items highly similar to real items to create homogeneous concealed dataset. Using fake transactions to maintain privacy was presented in [3], [20, 21], the authors considered adding fake transactions to anonymise the original data transactions. This approach has several advantages over other schemes including that any off-the-shelf data analytics algorithms can be used for analysing the concealed data and the ability to provide a high theoretical privacy guarantee. The locally concealed dataset obtained using CBO should be indistinguishable from the original dataset in order to preserve privacy. The core idea for CBO is to split the dataset into two subsets, the first subset is modified to satisfy K-Similar item group definition, and the other subset is concealed by substituting real items with fake items based probabilistic approach. CBO creates a concealed dataset Dp as following:
-
The sensitive items are suppressed from the dataset based on provider preferences thereafter we will have the suppressed dataset D as the real dataset.
-
Selecting a ϖ percent of highest frequent items in dataset D to form a new subset Dϖ. This step aims to reduce the substituted fake items inside the concealed dataset Dp. Moreover, it maintains data quality by preserving the aggregates of highly frequent preferences.
-
CBO builds affinity groups for each real item ∀ Iu ∈ Dϖ through adding fake items to form K-Similar items group. We implemented this task as a text categorization problem based on the feature vectors of real items. We also implemented a bag-of-words naive Bayesian text classifier [22] that extended to handle a vector of bags of words. The task continues until all items in Dϖ are belonging to different affinity groups, then we get a new dataset .
-
For each Iu ∈ Du = D − Dϖ, CBO selects a real item {Iu} from real item set Du with probability α or selects a fake item {In} from the candidate fake item set Dn with probability 1 − α. The selected item IP is added as a record to the concealed dataset DP. This method achieves the desired privacy guarantee because the type of selected item and α are unknowns to external parties. The process continues until all real items in Du are selected.
-
Finally, the concealed dataset DP is merged with the subset , which obtained from step 3.
In terms of performance, CBO requires supplementary storage costs and computations costs. The supplementary storage costs can be reduced by clustering items in the resulting dataset into C clusters and use the feature’ vectors of top N items with high rates in each cluster for CBO algorithm. Thus supplementary storage costs will be in order of O(CN). The computation costs for CBO are divided between computational complexities required to create affinity groups and adding fake items. Obviously, the computation overhead in creating affinity groups dominates, and it can be reduced by selecting lower values for ϖ.
After executing CBO, the synchronization agents build a virtualized schema with the aid of the coordinator agent at IoMT-enabled DMS then the global concealment agent starts executing the RRG algorithm. The coordinator agent will not be able to know the real items in the merged datasets as they already concealed locally using CBO algorithm. The main aim for the RRG is to alleviate data sparsity problem by filling the empty cells in such a way to improve the accuracy of the predictions at the environmental monitoring side and increase the attained privacy for providers. The RRG algorithm consists of following steps:
-
The global concealment agent finds the number of majority frequent items Ir and partially frequent items by all users I − Ir, where I denotes the total number of items in merged datasets.
-
The global concealment agent randomly selects an integer ρ between 0 and 100, and then chooses a uniform random number ξ over the range [0, ρ].
-
The global concealment agent decides ξ percent of the partially frequent items in merged datasets and uses the KNN to predicate the values of the empty cells for that percentage.
-
The remaining empty cells are filled by random values chosen using a distribution reflecting the frequent items in the merged datasets.
The privacy of the merged datasets is maintained because all the processing is done on the datasets that previously processed using CBO. The global concealment agent improves the overall privacy and accuracy by increasing the density of the merged datasets due to the filled cells. With increasing ρ values, the RRG reduces the randomness in the frequencies. That might increase the accuracy of the predictions while decreases the privacy level. So, RRG should select ρ in a way to achieve the required balance between privacy and accuracy.
V. EXPERIMENTAL RESULTS
The proposed algorithms are implemented in C++, we used message-passing interface (MPI) for a distributed memory implementation of RRG algorithm to mimic a distributed network of nodes. In order to evaluate the effect of our proposed algorithms on mashuped datasets used in problem solving. A dataset pulled from the SportyPal® network that was linked to another dataset containing behavioural and neighbouring environment data of 8000 students in the University of Zagreb in Croatia in the period of 2006 to 2008. For the purpose of this work, we intended to measure two aspects in this dataset, which are privacy breach levels and accuracy of results. We divide the dataset into a training set and testing set. The training set is concealed then used as a database for the monitoring service. To evaluate the accuracy of the generated predictions, we used the mean average error (MAE) metric proposed in [23]. To measure the privacy breach levels, we used mutual information as a measure for the notion of privacy breach of Du through Dp.
In the first experiment, we want to measure the relation between the quantity of real items in the concealed dataset and privacy breach, we select in a range from 1.0 to 5.5, and we increased the number of real items from 100 to 1000. We select fake items set using uniform distribution as a baseline. As shown in figure (3), our generated fake set reduces the privacy breach and performs much better than uniform fake set. As the number of real items increase the uniform fake set get worse as more information is leaked while our optimal fake set does not affect with that attitude.
In the second experiment, we measured the relation between the quantity of fake items in the subset Dϖ and the accuracy of the classification results. We select a set of real items from our dataset, then we split it into two subsets Dϖ and Du. We concealed subset Du with fixed value for α to obtain the subset Dp. We append the subset Dϖ with either items from optimal fake set or uniform fake set. Thereafter, we gradually increased the percentage of real items in Dϖ that are selected from our dataset from 0.1 to 0.9. Figure (4) shows MAE values as a function of the concealment rate for the whole concealed dataset Dp. The IoMT context-based service can select a concealment rate based on its privacy preferences. Hence, with a higher value for the concealment rate, higher accurate predictions can be attained by the monitoring service. Adding items from the optimal fake set have a minor impact on MAE of the results without having to select a higher value for the concealment rate.
IV. CONCLUSION
In this work, we presented our ongoing work on building a fog-based middleware for private data mashup (FMPM) to serve centralized IoT-enabled environmental monitoring service. We gave a brief overview over the mashup process and two concealment mechanisms. The experiments show our approach reduces privacy breaches and attains accurate results. We realized many challenges in building an IoMT-enabled data mashup service. As a result, we focused on environmental monitoring service scenario. This allows us to move forward in building an integrated system while studying issues such as a dynamic data release at a later stage and deferring certain issues such as virtualized schema and auditing to future research agenda.