Data Provenance Some Basic Issues
HEALTH 1.1
SERVICES
Advertisements Advertisement Sales Department, E-mail: health@ Reprints (minimum quantity 100 copies) Reprints Co-ordinator, Scientific Research Publishing, Inc., USA. E-mail: health@
We are pleased to announce that the Editorial Board Member of Health, Kuo-Chen Chou, has been identified by Science Watch (/ana/fea/09maraprFea/) as the author with the highest numbers of Hot Papers published over the preceding two years (2007 and 2008). Among the 13 authors listed in the table of “Scientists with Multiple Hot Papers” by Science Watch, Professor Dr. Kuo-Chen Chou of Gordon Life Science institute and Shanghai Jiaotong University ranks No.1 with 17 hot papers. Meanwhile, the review article by Kuo-Chen Chou and Hong-Bin Shen, entitled “Recent Progresses in Protein Subcellular Location Prediction” published in Analytical Biochemistry, has been identified by Science Watch as the New Hot Paper in the field of Biology & Biochemistry (/dr/nhp/2009/09marnhp/09marnhpChou/). For more information about the hot research and hot papers, go to visit the web-sites at /htmlnews/2009/3/216833.html; /ana/fea/pdf/09maraprFea.pdf; and /dr/nhp/2009/pdf/09marnhpChou.pdf. Please join us to send our sincere and warm congratulations to our fellow board member, Kuo-Chen Chou, for his prominent contributions in science. Meanwhile, we hope this announcement can attract more researchers to submit their best papers to Health, the journal that publishes the highest quality of research and review articles in all important aspects of human health, medicine, engineering, and their intersection.
微信危险的英语作文
WeChat,a popular Chinese messaging and social media app,has become an integral part of daily life for many people around the world.However,with its widespread use, there are also potential dangers associated with it.Here are some points to consider when discussing the potential risks of using WeChat:1.Privacy Concerns:WeChat collects a significant amount of user data,which can be a concern for ers should be aware of what information they are sharing and with whom.2.Security Vulnerabilities:Like any digital platform,WeChat is not immune to security ers need to be vigilant about protecting their accounts with strong passwords and enabling twofactor authentication.3.Misinformation Spread:The platform can be used to spread false information or ers should verify the credibility of the information they receive and share.4.Cyberbullying:As with any social media platform,WeChat can be a place where cyberbullying occurs.Its important for users to report any abusive behavior and to educate themselves on how to deal with such situations.5.Addiction:The constant connectivity that WeChat offers can lead to overuse and addiction,affecting users mental health and social interactions in the real world.6.Financial Scams:WeChat has payment features that can be exploited by scammers. Users should be cautious when making transactions and only deal with trusted contacts.7.Political Censorship:WeChat is known to comply with Chinese government regulations,which can lead to censorship of certain topics or content.This can limit the freedom of speech and access to information.8.Data Localization Laws:In some countries,there are concerns about data localization laws that require companies like Tencent WeChats parent company to store user data within the countrys borders,potentially making it more accessible to local authorities.9.Influence on Youth:The influence of social media,including WeChat,on young peoples behavior and values can be significant.Parents and educators should be aware of the content young people are exposed to and engage in conversations about responsible use.10.Business Risks:For businesses using WeChat for marketing or customer service,theres a risk of negative publicity if the platform is used improperly or if there are issues with the apps functionality.To mitigate these risks,its essential for users to stay informed about best practices for online safety,to be critical of the information they encounter,and to maintain a balance between online and offline life.Additionally,understanding the legal and regulatory environment in which WeChat operates can help users navigate potential challenges related to data privacy and security.。
英语论文口语展示
-
01 引言 02 Section 1: INTRODUCTION 03 Section 2: Our Contribution 04 Section 3: Technical Aspects 05 Section 4 06 Section 5
1
引言
引言
Hello everyone, today I will be presenting English paper on the topic of "Blockchain Based Non-repudiable IoT Data Trading: Simpler, Faster, andCheaper"
h a system, the data buyers firstsend a request to the IoT data owners. The buyers also pay for the data to thethird party. After payment confirmation from the third par
Section 2: Our Contribution
"locked" in the smart contract. After receiving thatrequest, the IoT data owner firstsendssome partial version ofthedata to thedata buyer,denotedasS1 .We note that the partialversionof thedataisnotsufficientfor thedata buyer to fulfill itspurpose
云计算外文翻译参考文献
云计算外文翻译参考文献(文档含中英文对照即英文原文和中文翻译)原文:Technical Issues of Forensic Investigations in Cloud Computing EnvironmentsDominik BirkRuhr-University BochumHorst Goertz Institute for IT SecurityBochum, GermanyRuhr-University BochumHorst Goertz Institute for IT SecurityBochum, GermanyAbstract—Cloud Computing is arguably one of the most discussedinformation technologies today. It presents many promising technological and economical opportunities. However, many customers remain reluctant to move their business IT infrastructure completely to the cloud. One of their main concerns is Cloud Security and the threat of the unknown. Cloud Service Providers(CSP) encourage this perception by not letting their customers see what is behind their virtual curtain. A seldomly discussed, but in this regard highly relevant open issue is the ability to perform digital investigations. This continues to fuel insecurity on the sides of both providers and customers. Cloud Forensics constitutes a new and disruptive challenge for investigators. Due to the decentralized nature of data processing in the cloud, traditional approaches to evidence collection and recovery are no longer practical. This paper focuses on the technical aspects of digital forensics in distributed cloud environments. We contribute by assessing whether it is possible for the customer of cloud computing services to perform a traditional digital investigation from a technical point of view. Furthermore we discuss possible solutions and possible new methodologies helping customers to perform such investigations.I. INTRODUCTIONAlthough the cloud might appear attractive to small as well as to large companies, it does not come along without its own unique problems. Outsourcing sensitive corporate data into the cloud raises concerns regarding the privacy and security of data. Security policies, companies main pillar concerning security, cannot be easily deployed into distributed, virtualized cloud environments. This situation is further complicated by the unknown physical location of the companie’s assets. Normally,if a security incident occurs, the corporate security team wants to be able to perform their own investigation without dependency on third parties. In the cloud, this is not possible anymore: The CSP obtains all the power over the environmentand thus controls the sources of evidence. In the best case, a trusted third party acts as a trustee and guarantees for the trustworthiness of the CSP. Furthermore, the implementation of the technical architecture and circumstances within cloud computing environments bias the way an investigation may be processed. In detail, evidence data has to be interpreted by an investigator in a We would like to thank the reviewers for the helpful comments and Dennis Heinson (Center for Advanced Security Research Darmstadt - CASED) for the profound discussions regarding the legal aspects of cloud forensics. proper manner which is hardly be possible due to the lackof circumstantial information. For auditors, this situation does not change: Questions who accessed specific data and information cannot be answered by the customers, if no corresponding logs are available. With the increasing demand for using the power of the cloud for processing also sensible information and data, enterprises face the issue of Data and Process Provenance in the cloud [10]. Digital provenance, meaning meta-data that describes the ancestry or history of a digital object, is a crucial feature for forensic investigations. In combination with a suitable authentication scheme, it provides information about who created and who modified what kind of data in the cloud. These are crucial aspects for digital investigations in distributed environments such as the cloud. Unfortunately, the aspects of forensic investigations in distributed environment have so far been mostly neglected by the research community. Current discussion centers mostly around security, privacy and data protection issues [35], [9], [12]. The impact of forensic investigations on cloud environments was little noticed albeit mentioned by the authors of [1] in 2009: ”[...] to our knowledge, no research has been published on how cloud computing environments affect digital artifacts,and on acquisition logistics and legal issues related to cloud computing env ironments.” This statement is also confirmed by other authors [34], [36], [40] stressing that further research on incident handling, evidence tracking and accountability in cloud environments has to be done. At the same time, massive investments are being made in cloud technology. Combined with the fact that information technology increasingly transcendents peoples’ private and professional life, thus mirroring more and more of peoples’actions, it becomes apparent that evidence gathered from cloud environments will be of high significance to litigation or criminal proceedings in the future. Within this work, we focus the notion of cloud forensics by addressing the technical issues of forensics in all three major cloud service models and consider cross-disciplinary aspects. Moreover, we address the usability of various sources of evidence for investigative purposes and propose potential solutions to the issues from a practical standpoint. This work should be considered as a surveying discussion of an almost unexplored research area. The paper is organized as follows: We discuss the related work and the fundamental technical background information of digital forensics, cloud computing and the fault model in section II and III. In section IV, we focus on the technical issues of cloud forensics and discuss the potential sources and nature of digital evidence as well as investigations in XaaS environments including thecross-disciplinary aspects. We conclude in section V.II. RELATED WORKVarious works have been published in the field of cloud security and privacy [9], [35], [30] focussing on aspects for protecting data in multi-tenant, virtualized environments. Desired security characteristics for current cloud infrastructures mainly revolve around isolation of multi-tenant platforms [12], security of hypervisors in order to protect virtualized guest systems and secure network infrastructures [32]. Albeit digital provenance, describing the ancestry of digital objects, still remains a challenging issue for cloud environments, several works have already been published in this field [8], [10] contributing to the issues of cloud forensis. Within this context, cryptographic proofs for verifying data integrity mainly in cloud storage offers have been proposed,yet lacking of practical implementations [24], [37], [23]. Traditional computer forensics has already well researched methods for various fields of application [4], [5], [6], [11], [13]. Also the aspects of forensics in virtual systems have been addressed by several works [2], [3], [20] including the notionof virtual introspection [25]. In addition, the NIST already addressed Web Service Forensics [22] which has a huge impact on investigation processes in cloud computing environments. In contrast, the aspects of forensic investigations in cloud environments have mostly been neglected by both the industry and the research community. One of the first papers focusing on this topic was published by Wolthusen [40] after Bebee et al already introduced problems within cloud environments [1]. Wolthusen stressed that there is an inherent strong need for interdisciplinary work linking the requirements and concepts of evidence arising from the legal field to what can be feasibly reconstructed and inferred algorithmically or in an exploratory manner. In 2010, Grobauer et al [36] published a paper discussing the issues of incident response in cloud environments - unfortunately no specific issues and solutions of cloud forensics have been proposed which will be done within this work.III. TECHNICAL BACKGROUNDA. Traditional Digital ForensicsThe notion of Digital Forensics is widely known as the practice of identifying, extracting and considering evidence from digital media. Unfortunately, digital evidence is both fragile and volatile and therefore requires the attention of special personnel and methods in order to ensure that evidence data can be proper isolated and evaluated. Normally, the process of a digital investigation can be separated into three different steps each having its own specificpurpose:1) In the Securing Phase, the major intention is the preservation of evidence for analysis. The data has to be collected in a manner that maximizes its integrity. This is normally done by a bitwise copy of the original media. As can be imagined, this represents a huge problem in the field of cloud computing where you never know exactly where your data is and additionallydo not have access to any physical hardware. However, the snapshot technology, discussed in section IV-B3, provides a powerful tool to freeze system states and thus makes digital investigations, at least in IaaS scenarios, theoretically possible.2) We refer to the Analyzing Phase as the stage in which the data is sifted and combined. It is in this phase that the data from multiple systems or sources is pulled together to create as complete a picture and event reconstruction as possible. Especially in distributed system infrastructures, this means that bits and pieces of data are pulled together for deciphering the real story of what happened and for providing a deeper look into the data.3) Finally, at the end of the examination and analysis of the data, the results of the previous phases will be reprocessed in the Presentation Phase. The report, created in this phase, is a compilation of all the documentation and evidence from the analysis stage. The main intention of such a report is that it contains all results, it is complete and clear to understand. Apparently, the success of these three steps strongly depends on the first stage. If it is not possible to secure the complete set of evidence data, no exhaustive analysis will be possible. However, in real world scenarios often only a subset of the evidence data can be secured by the investigator. In addition, an important definition in the general context of forensics is the notion of a Chain of Custody. This chain clarifies how and where evidence is stored and who takes possession of it. Especially for cases which are brought to court it is crucial that the chain of custody is preserved.B. Cloud ComputingAccording to the NIST [16], cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal CSP interaction. The new raw definition of cloud computing brought several new characteristics such as multi-tenancy, elasticity, pay-as-you-go and reliability. Within this work, the following three models are used: In the Infrastructure asa Service (IaaS) model, the customer is using the virtual machine provided by the CSP for installing his own system on it. The system can be used like any other physical computer with a few limitations. However, the additive customer power over the system comes along with additional security obligations. Platform as a Service (PaaS) offerings provide the capability to deploy application packages created using the virtual development environment supported by the CSP. For the efficiency of software development process this service model can be propellent. In the Software as a Service (SaaS) model, the customer makes use of a service run by the CSP on a cloud infrastructure. In most of the cases this service can be accessed through an API for a thin client interface such as a web browser. Closed-source public SaaS offers such as Amazon S3 and GoogleMail can only be used in the public deployment model leading to further issues concerning security, privacy and the gathering of suitable evidences. Furthermore, two main deployment models, private and public cloud have to be distinguished. Common public clouds are made available to the general public. The corresponding infrastructure is owned by one organization acting as a CSP and offering services to its customers. In contrast, the private cloud is exclusively operated for an organization but may not provide the scalability and agility of public offers. The additional notions of community and hybrid cloud are not exclusively covered within this work. However, independently from the specific model used, the movement of applications and data to the cloud comes along with limited control for the customer about the application itself, the data pushed into the applications and also about the underlying technical infrastructure.C. Fault ModelBe it an account for a SaaS application, a development environment (PaaS) or a virtual image of an IaaS environment, systems in the cloud can be affected by inconsistencies. Hence, for both customer and CSP it is crucial to have the ability to assign faults to the causing party, even in the presence of Byzantine behavior [33]. Generally, inconsistencies can be caused by the following two reasons:1) Maliciously Intended FaultsInternal or external adversaries with specific malicious intentions can cause faults on cloud instances or applications. Economic rivals as well as former employees can be the reason for these faults and state a constant threat to customers and CSP. In this model, also a malicious CSP is included albeit he isassumed to be rare in real world scenarios. Additionally, from the technical point of view, the movement of computing power to a virtualized, multi-tenant environment can pose further threads and risks to the systems. One reason for this is that if a single system or service in the cloud is compromised, all other guest systems and even the host system are at risk. Hence, besides the need for further security measures, precautions for potential forensic investigations have to be taken into consideration.2) Unintentional FaultsInconsistencies in technical systems or processes in the cloud do not have implicitly to be caused by malicious intent. Internal communication errors or human failures can lead to issues in the services offered to the costumer(i.e. loss or modification of data). Although these failures are not caused intentionally, both the CSP and the customer have a strong intention to discover the reasons and deploy corresponding fixes.IV. TECHNICAL ISSUESDigital investigations are about control of forensic evidence data. From the technical standpoint, this data can be available in three different states: at rest, in motion or in execution. Data at rest is represented by allocated disk space. Whether the data is stored in a database or in a specific file format, it allocates disk space. Furthermore, if a file is deleted, the disk space is de-allocated for the operating system but the data is still accessible since the disk space has not been re-allocated and overwritten. This fact is often exploited by investigators which explore these de-allocated disk space on harddisks. In case the data is in motion, data is transferred from one entity to another e.g. a typical file transfer over a network can be seen as a data in motion scenario. Several encapsulated protocols contain the data each leaving specific traces on systems and network devices which can in return be used by investigators. Data can be loaded into memory and executed as a process. In this case, the data is neither at rest or in motion but in execution. On the executing system, process information, machine instruction and allocated/de-allocated data can be analyzed by creating a snapshot of the current system state. In the following sections, we point out the potential sources for evidential data in cloud environments and discuss the technical issues of digital investigations in XaaS environmentsas well as suggest several solutions to these problems.A. Sources and Nature of EvidenceConcerning the technical aspects of forensic investigations, the amount of potential evidence available to the investigator strongly diverges between thedifferent cloud service and deployment models. The virtual machine (VM), hosting in most of the cases the server application, provides several pieces of information that could be used by investigators. On the network level, network components can provide information about possible communication channels between different parties involved. The browser on the client, acting often as the user agent for communicating with the cloud, also contains a lot of information that could be used as evidence in a forensic investigation. Independently from the used model, the following three components could act as sources for potential evidential data.1) Virtual Cloud Instance: The VM within the cloud, where i.e. data is stored or processes are handled, contains potential evidence [2], [3]. In most of the cases, it is the place where an incident happened and hence provides a good starting point for a forensic investigation. The VM instance can be accessed by both, the CSP and the customer who is running the instance. Furthermore, virtual introspection techniques [25] provide access to the runtime state of the VM via the hypervisor and snapshot technology supplies a powerful technique for the customer to freeze specific states of the VM. Therefore, virtual instances can be still running during analysis which leads to the case of live investigations [41] or can be turned off leading to static image analysis. In SaaS and PaaS scenarios, the ability to access the virtual instance for gathering evidential information is highly limited or simply not possible.2) Network Layer: Traditional network forensics is knownas the analysis of network traffic logs for tracing events that have occurred in the past. Since the different ISO/OSI network layers provide several information on protocols and communication between instances within as well as with instances outside the cloud [4], [5], [6], network forensics is theoretically also feasible in cloud environments. However in practice, ordinary CSP currently do not provide any log data from the network components used by the customer’s instances or applications. For instance, in case of a malware infection of an IaaS VM, it will be difficult for the investigator to get any form of routing information and network log datain general which is crucial for further investigative steps. This situation gets even more complicated in case of PaaS or SaaS. So again, the situation of gathering forensic evidence is strongly affected by the support the investigator receives from the customer and the CSP.3) Client System: On the system layer of the client, it completely depends on the used model (IaaS, PaaS, SaaS) if and where potential evidence could beextracted. In most of the scenarios, the user agent (e.g. the web browser) on the client system is the only application that communicates with the service in the cloud. This especially holds for SaaS applications which are used and controlled by the web browser. But also in IaaS scenarios, the administration interface is often controlled via the browser. Hence, in an exhaustive forensic investigation, the evidence data gathered from the browser environment [7] should not be omitted.a) Browser Forensics: Generally, the circumstances leading to an investigation have to be differentiated: In ordinary scenarios, the main goal of an investigation of the web browser is to determine if a user has been victim of a crime. In complex SaaS scenarios with high client-server interaction, this constitutes a difficult task. Additionally, customers strongly make use of third-party extensions [17] which can be abused for malicious purposes. Hence, the investigator might want to look for malicious extensions, searches performed, websites visited, files downloaded, information entered in forms or stored in local HTML5 stores, web-based email contents and persistent browser cookies for gathering potential evidence data. Within this context, it is inevitable to investigate the appearance of malicious JavaScript [18] leading to e.g. unintended AJAX requests and hence modified usage of administration interfaces. Generally, the web browser contains a lot of electronic evidence data that could be used to give an answer to both of the above questions - even if the private mode is switched on [19].B. Investigations in XaaS EnvironmentsTraditional digital forensic methodologies permit investigators to seize equipment and perform detailed analysis on the media and data recovered [11]. In a distributed infrastructure organization like the cloud computing environment, investigators are confronted with an entirely different situation. They have no longer the option of seizing physical data storage. Data and processes of the customer are dispensed over an undisclosed amount of virtual instances, applications and network elements. Hence, it is in question whether preliminary findings of the computer forensic community in the field of digital forensics apparently have to be revised and adapted to the new environment. Within this section, specific issues of investigations in SaaS, PaaS and IaaS environments will be discussed. In addition, cross-disciplinary issues which affect several environments uniformly, will be taken into consideration. We also suggest potential solutions to the mentioned problems.1) SaaS Environments: Especially in the SaaS model, the customer does notobtain any control of the underlying operating infrastructure such as network, servers, operating systems or the application that is used. This means that no deeper view into the system and its underlying infrastructure is provided to the customer. Only limited userspecific application configuration settings can be controlled contributing to the evidences which can be extracted fromthe client (see section IV-A3). In a lot of cases this urges the investigator to rely on high-level logs which are eventually provided by the CSP. Given the case that the CSP does not run any logging application, the customer has no opportunity to create any useful evidence through the installation of any toolkit or logging tool. These circumstances do not allow a valid forensic investigation and lead to the assumption that customers of SaaS offers do not have any chance to analyze potential incidences.a) Data Provenance: The notion of Digital Provenance is known as meta-data that describes the ancestry or history of digital objects. Secure provenance that records ownership and process history of data objects is vital to the success of data forensics in cloud environments, yet it is still a challenging issue today [8]. Albeit data provenance is of high significance also for IaaS and PaaS, it states a huge problem specifically for SaaS-based applications: Current global acting public SaaS CSP offer Single Sign-On (SSO) access control to the set of their services. Unfortunately in case of an account compromise, most of the CSP do not offer any possibility for the customer to figure out which data and information has been accessed by the adversary. For the victim, this situation can have tremendous impact: If sensitive data has been compromised, it is unclear which data has been leaked and which has not been accessed by the adversary. Additionally, data could be modified or deleted by an external adversary or even by the CSP e.g. due to storage reasons. The customer has no ability to proof otherwise. Secure provenance mechanisms for distributed environments can improve this situation but have not been practically implemented by CSP [10]. Suggested Solution: In private SaaS scenarios this situation is improved by the fact that the customer and the CSP are probably under the same authority. Hence, logging and provenance mechanisms could be implemented which contribute to potential investigations. Additionally, the exact location of the servers and the data is known at any time. Public SaaS CSP should offer additional interfaces for the purpose of compliance, forensics, operations and security matters to their customers. Through an API, the customers should have the ability to receive specific information suchas access, error and event logs that could improve their situation in case of aninvestigation. Furthermore, due to the limited ability of receiving forensic information from the server and proofing integrity of stored data in SaaS scenarios, the client has to contribute to this process. This could be achieved by implementing Proofs of Retrievability (POR) in which a verifier (client) is enabled to determine that a prover (server) possesses a file or data object and it can be retrieved unmodified [24]. Provable Data Possession (PDP) techniques [37] could be used to verify that an untrusted server possesses the original data without the need for the client to retrieve it. Although these cryptographic proofs have not been implemented by any CSP, the authors of [23] introduced a new data integrity verification mechanism for SaaS scenarios which could also be used for forensic purposes.2) PaaS Environments: One of the main advantages of the PaaS model is that the developed software application is under the control of the customer and except for some CSP, the source code of the application does not have to leave the local development environment. Given these circumstances, the customer obtains theoretically the power to dictate how the application interacts with other dependencies such as databases, storage entities etc. CSP normally claim this transfer is encrypted but this statement can hardly be verified by the customer. Since the customer has the ability to interact with the platform over a prepared API, system states and specific application logs can be extracted. However potential adversaries, which can compromise the application during runtime, should not be able to alter these log files afterwards. Suggested Solution:Depending on the runtime environment, logging mechanisms could be implemented which automatically sign and encrypt the log information before its transfer to a central logging server under the control of the customer. Additional signing and encrypting could prevent potential eavesdroppers from being able to view and alter log data information on the way to the logging server. Runtime compromise of an PaaS application by adversaries could be monitored by push-only mechanisms for log data presupposing that the needed information to detect such an attack are logged. Increasingly, CSP offering PaaS solutions give developers the ability to collect and store a variety of diagnostics data in a highly configurable way with the help of runtime feature sets [38].3) IaaS Environments: As expected, even virtual instances in the cloud get compromised by adversaries. Hence, the ability to determine how defenses in the virtual environment failed and to what extent the affected systems havebeen compromised is crucial not only for recovering from an incident. Also forensic investigations gain leverage from such information and contribute to resilience against future attacks on the systems. From the forensic point of view, IaaS instances do provide much more evidence data usable for potential forensics than PaaS and SaaS models do. This fact is caused throughthe ability of the customer to install and set up the image for forensic purposes before an incident occurs. Hence, as proposed for PaaS environments, log data and other forensic evidence information could be signed and encrypted before itis transferred to third-party hosts mitigating the chance that a maliciously motivated shutdown process destroys the volatile data. Although, IaaS environments provide plenty of potential evidence, it has to be emphasized that the customer VM is in the end still under the control of the CSP. He controls the hypervisor which is e.g. responsible for enforcing hardware boundaries and routing hardware requests among different VM. Hence, besides the security responsibilities of the hypervisor, he exerts tremendous control over how customer’s VM communicate with the hardware and theoretically can intervene executed processes on the hosted virtual instance through virtual introspection [25]. This could also affect encryption or signing processes executed on the VM and therefore leading to the leakage of the secret key. Although this risk can be disregarded in most of the cases, the impact on the security of high security environments is tremendous.a) Snapshot Analysis: Traditional forensics expect target machines to be powered down to collect an image (dead virtual instance). This situation completely changed with the advent of the snapshot technology which is supported by all popular hypervisors such as Xen, VMware ESX and Hyper-V.A snapshot, also referred to as the forensic image of a VM, providesa powerful tool with which a virtual instance can be clonedby one click including also the running system’s mem ory. Due to the invention of the snapshot technology, systems hosting crucial business processes do not have to be powered down for forensic investigation purposes. The investigator simply creates and loads a snapshot of the target VM for analysis(live virtual instance). This behavior is especially important for scenarios in which a downtime of a system is not feasible or practical due to existing SLA. However the information whether the machine is running or has been properly powered down is crucial [3] for the investigation. Live investigations of running virtual instances become more common providing evidence data that。
经管实证英文文献常用的缺失值处理方法
经管实证英文文献常用的缺失值处理方法Methods for Handling Missing Values in Empirical Studies in Economics and ManagementMissing values are a common issue in empirical studies in economics and management. These missing values can occur for a variety of reasons, such as data collection errors, non-response from survey participants, or incomplete information. Dealing with missing values is crucial for maintaining the quality and reliability of empirical findings. In this article, we will discuss some common methods for handling missing values in empirical studies in economics and management.1. Complete Case AnalysisOne common approach to handling missing values is to simply exclude cases with missing values from the analysis. This method is known as complete case analysis. While this method is simple and straightforward, it can lead to biased results if the missing values are not missing completely at random. In other words, if the missing values are related to the outcome of interest, excluding cases with missing values can lead to biased estimates.2. Imputation TechniquesImputation techniques are another common method for handling missing values. Imputation involves replacing missing values with estimated values based on the observed data. There are several methods for imputing missing values, including mean imputation, median imputation, and regression imputation. Mean imputation involves replacing missing values with the mean of the observed values for that variable. Median imputation involves replacing missing values with the median of the observed values. Regression imputation involves using a regression model to predict missing values based on other variables in the dataset.3. Multiple ImputationMultiple imputation is a more sophisticated imputation technique that involves generating multiple plausible values for each missing value and treating each set of imputed values as a complete dataset. This allows for uncertainty in the imputed values to be properly accounted for in the analysis. Multiple imputation has been shown to produce less biased estimates compared to single imputation methods.4. Maximum Likelihood EstimationMaximum likelihood estimation is another method for handling missing values that involves estimating the parametersof a statistical model by maximizing the likelihood function of the observed data. Missing values are treated as parameters to be estimated along with the other parameters of the model. Maximum likelihood estimation has been shown to produce unbiased estimates under certain assumptions about the missing data mechanism.5. Sensitivity AnalysisSensitivity analysis is a useful technique for assessing the robustness of empirical findings to different methods of handling missing values. This involves conducting the analysis using different methods for handling missing values and comparing the results. If the results are consistent across different methods, this provides more confidence in the validity of the findings.In conclusion, there are several methods available for handling missing values in empirical studies in economics and management. Each method has its advantages and limitations, and the choice of method should be guided by the nature of the data and the research question. It is important to carefully consider the implications of missing values and choose the most appropriate method for handling them to ensure the validity and reliability of empirical findings.。
provenance(来源原则)
The Archival Principle of Provenance and Its Applicationto Image Representation SystemsVariously described as a “powerful guiding principle” (Dearstyne, 1993), and “the only principle” of archival theory (Horsman, 1994), the Principle of Provenance distinguishes the archival profession from other information professions in its focus on a document’s context, use and meaning. This Principle, generally concerned with the origin of records, has three distinct meanings (Bellardo & Bellardo, 1992). First, and generally, it refers to the “office of origin” of records, or that office, administrative entity, person, family, firm, from which records, personal papers or manuscripts originate. Second, it refers to collecting information on successive transfers of ownership or custody of a particular paper or manuscript; and third, it refers to the idea that an archival collection of a given records creator must not be intermingled with those of other records creators. In this sense, the principle is often referred to by the French expression respect des fonds. A corollary principle, solemnly entitled, “Principle of the Sanctity of Original Order,” states that records should be kept in the order in which they were originally arranged.The Principle of Provenance was independently developed by early modern French and Prussian archives managers in the nineteenth century, and had its origins in necessity, both theoretical and practical. Prior to the development of the Principle, archives were arranged and described according to the “principle of pertinence,” where archives were arranged in terms of their subject content regardless of provenance and original order (Gränström, 1994). With the development of state-run archives in France and Prussia, the sheer volume of incoming records made working by this ethic impractical. Furthermore, historians of this era were, as they still are, concerned with objectivity of their original source material. They wanted to be able to establish what really took place, and to do that, they felt that the written sources should be maintained in their original order, and not rearranged. So the Principle met both standards – it was much easier and faster to process collections if there was no need to assign subject headings to each document or fond; and it met the objectivity standards put forth by historians. Related to the historical standards, the Principle of Provenance also held with medieval diplomatic procedures, which were concerned with defining and evaluating records based on their authenticity and evidential, primarily legal, value.However powerful, objective, and practical the Principle of Provenance might be, there is still a major complexity that bears some examination, namely the organic nature of archives. Peter Horsman has written two articles related to this problem. His essential argument is that an archival source (be it an administration, a person or a family) is a living organism, and its fonds grow and change with it, and there is rarely a time where one absolute, unchanging physical order for its documents exists. Rather, the fonds “are a complicated result of the activities of the creator, political decisions, organizational behavior, record-keeping methods and many other unexpected events” (Horsman, 1994). The traditional inventory or finding aid is simply a snapshot of the records at one distinct moment in time, typically at the end of their useful life, and acts only as evidence that this certain set of inter-related documents were physically gathered together at some defined instant (Horsman, 1999). The real power of an archive, as yet underutilized, is the notion of providing context. Context is a more complicated concept than “original order,” however, and in this case is concerned primarily with describing a continuum of relationships and inter-relationships over time and place. Preserving the physical original order of a fonds, which Horsman defines as the internal application of the Principle of Provenance, is merely a logistical artifact; valuable because it is, at least, “an original administrative artifact,” not defined from outside. To comprehend context, Horsman argues that the archivist not only has to describe and define the structure of the fonds in its seris and sub-series, but also to define and describe the relationships between the agency’s characteristics or functions, and the records it has created throughout the range of its existence.Unlikely though it may be, this idea of providing meaningful contextual information is also a problem being considered by art historians, in a quest to describe of works of art from different cultures in significant and equivalent language. The most recent work is being done by David Summers, in his new tome, Real Spaces: World Art History and the Rise of Western Modernism (Summers, 2003). Although the two fields, archival science and the history of art might, at first glance, seem to have little in common, on the first page of the introduction, Summers states, “However the discipline of the history of art may have changed over the last few decades of theoretical and critical examination, it has continued to be an archival field, concerned with setting its objects in spatial and temporal order, and with relating them to appropriate documents and archaeological evidence.” In trying to develop a new descriptive language for works of art, Summers focuses on the “organic nature” of the work – concentrating on the overarchingtheoretical construct of “facture,” which embodies the idea that the object itself carries some record of its having been made. The value of this physical and format-based characteristic is primary and unassailable.1 There is an obvious parallel here with the “organic character of records,” discussed by Schellenberg (1961),“Records that are the product of organic activity have a value that derives from theway they were produced. Since they were created in consequence of the actions towhich they relate, they often contain an unconscious and therefore impartial record ofthe action. Thus the evidence they contain of the actions they record has a peculiarvalue. It is the quality of this evidence that is our concern here. Records, however,also have a value for the evidence they contain of the actions that resulted in theirproduction. It is the content of the evidence that is our concern here.”What Summers calls “facture,” and Schellenberg calls “evidential value,” are related, and I think not explicitly spelled out due to the varying nature of their tasks: Summers is presenting a highly theoretical descriptive language for works of art, and Schellenberg, while concerned with theoretical underpinnings, is primarily interested in providing a real framework within which real, physical organizations (namely archives) can arrange and describe their collections.How does this relate to image content management systems? While Summers’ framework, such as it is,2 could be expanded to include descriptive languages for “anything that is made,” it was developed first and foremost for cultural, artistic artifacts. He argues that access to and understanding of artifacts will improve if we could provide more complete information on a given artifact’s facture (Winget, 2003) and provenance. Significantly, Summers is using the term “provenance” in an archival sense – he is concerned with documenting the name of the creator as well as the organization or entity for which the artifact was created, that creator or entity’s functions, relationships, and predecessors; and the artifact’s successive spaces and uses throughout the range of its life. The fact that a Renaissance triptych, for example, started out as a functional devotional device, lost that functionality, was collected by a host of individuals for its monetary or artifactual value, let’s say the last individual to collect the triptych was a German Jew, whose collection was perhaps stolen by the Nazis, and now it resides in an American Museum collection – is all noteworthy and interesting information, and, Summers argues rather forcefully, significantly more valuable than simply providing subject access to that image.1 I think it’s relevant here to point out that for Summers, a “work of art” is not limited to traditionally considered art objects. His definition is wider and more inclusive, and consists of “anything that is made.”2 Real Spaces is a nine-hundred-page book. It doesn’t put forth a “framework,” so much as a dense theoretical construct.Right now, image database managers, after worrying about quality and sustainability issues, seem to be primarily concerned with providing thematic or subject-oriented access to their collections. They are working with the “principle of pertinence,” as it were, and they’re running into the same problems that early-modern archivists had. It takes a very long time to provide robust subject access; it’s not objective, and in worst cases, can hinder retrieval. If they could twist the Principle of Provenance to relate primarily to providing access through description, rather than focusing on its use in arrangement,3 meaningful use of these image collections might rise, and retrieval problems might decline. The people in charge of image content management systems have a unique opportunity to develop a new system based principally on the user – providing facture and provenantial information without the difficulty of keeping a strict hierarchical structure that archives face. What’s more, for artifacts collected by museums at least, most of this information is already available: when acquiring a new work, curators research the artifact’s provenance to ensure that it is authentic and not stolen; conservators keep deliberate records about the format, materials and processes inherent in an artifact, and they furthermore tend to document any changes that happen to the work over time. There are a multitude of administrative attributes that are noted within the course of owning and maintaining culturally significant artifacts. The only problem is that these artifacts aren’t typically considered “important,” and they’re usually in paper form. If they are available digitally, access points are typically not provided (you can’t search on these terms).Summers’ new framework now gives us the theoretical tools to recognize these attributes’ importance, and the archival profession gives us a practical framework within which to work. Metadata initiatives like the Dublin Core and METS provide specific requirements for collecting information and describing these objects; the CIDOC-CRM provides an ontology that could be used to add semantic meaning (and hence understanding) between disparate attributes within these schemas; and OAIS provides frameworks within which information can be shared across space and disciplines. The pieces are all there. Provenance has proved to be a powerful and uniquely user-centered concept for the archival profession. With the advent of ubiquitous digital technology, which tends to help transfer ideas across traditional professional boundaries, it’s time to expand and translate that notion to other fields for other uses.3 I say that arrangement is not so important for image database structure because image databases generally don’t rely on hierarchies to the same extent that traditional archives do.ReferencesBellardo, L. J., & Bellardo, L. L. (1992). A glossary for archivists, manuscript curators, and records managers. Chicago: Society of American Archivists.Dearstyne, B. W. (1993). The archival enterprise: Modern archival principles, practices, and management techniques. Chicago, IL: American Library Association.Gränström, C. (1994). The Janus syndrome. The Principle of Provenance. Stockholm: Swedish National Archives.Horsman, P. (1994). Taming the elephant: An orthodox approach to the Principle of Provenance. The Principle of Provenance. Stockholm: Swedish National Archives.Horsman, P. (1999). Dirty Hands: A new perspective on the original order. Archives and Manuscripts, 27(1), 42-53.Schellenberg, T. R. (1961). Archival principles of arrangement. American Archivist, 24, 11-24. Summers, D. (2003). Real spaces: World art history and the rise of modernism. New York: Phaidon. Winget, M. (2003). Metadata for Digital Images: Theory and Practice.。
接受建议的英语作文
接受建议的英语作文英文回答:In a world where information is readily accessible, it is becoming increasingly important to have the ability to critically evaluate and select reliable sources. Accepting advice can be a challenging task, especially when it comes from an unknown or unreliable source. However, by following a few simple guidelines, you can significantly improve your chances of making informed decisions based on trustworthy advice.1. Consider the Source:The first step in evaluating advice is to consider the source. What is their expertise in the area? Do they have a vested interest in the outcome? Are they known for providing accurate and unbiased information? If you can't find a clear answer to these questions, you should approach the advice with caution.2. Look for Multiple Perspectives:It is never a good idea to rely on a single source for advice. Seek out multiple perspectives to gain a broader understanding of the issue. This will help you identify any potential biases or limitations in the information you are receiving.3. Evaluate the Evidence:Once you have gathered advice from multiple sources, it is important to evaluate the evidence presented to support each claim. Are the sources credible? Is the data presented reliable? Are there any obvious flaws in the logic or reasoning? By critically examining the evidence, you can assess the validity of the advice.4. Consider Your Own Values and Beliefs:While it is important to consider the advice of others, you should also take into account your own values andbeliefs. Does the advice align with your own principles and goals? If not, it may not be the best course of action for you.5. Seek Professional Guidance if Needed:In some cases, it may be necessary to seek professional guidance before making a decision. If the advice you are receiving is related to a sensitive or complex issue, it is recommended to consult with an expert in the field.中文回答:接受建议的技巧。
七年级英语数据统计单选题80题
七年级英语数据统计单选题80题1. There are ______ students in our class.A. twentyB. twentysC. twentyesD. twentith答案:A。
twenty 是20 的正确写法,B 选项twentys 写法错误,C 选项twentyes 写法错误,D 选项twentith 是序数词“第二十”,不符合题意。
2. I have ______ apples.A. threeB. thirdC. the threeD. the third答案:A。
three 是基数词“三”,表示数量,A 选项正确;B 选项third 是序数词“第三”;C 选项the three 表述错误;D 选项the third 是“第三”。
3. The price of the shoes is ______.A. fifty yuanB. fiftieth yuanC. the fifty yuanD. the fiftieth yuan答案:A。
fifty yuan 表示“五十元”,A 选项正确;B 选项fiftieth yuan 表述错误;C 选项the fifty yuan 表述错误;D 选项the fiftieth yuan 表示“第五十元”,不符合题意。
4. We need ______ books.A. fiveB. fifthC. the fiveD. the fifth答案:A。
five 是基数词“五”,表示数量,A 选项正确;B 选项fifth 是序数词“第五”;C 选项the five 表述错误;D 选项the fifth 是“第五”。
5. There are ______ days in a week.A. sevenB. seventhC. the sevenD. the seventh答案:A。
seven 是基数词“七”,表示数量,A 选项正确;B 选项seventh 是序数词“第七”;C 选项the seven 表述错误;D 选项the seventh 是“第七”。
2025年研究生考试考研英语(一201)试卷及答案指导
2025年研究生考试考研英语(一201)自测试卷及答案指导一、完型填空(10分)Section I: Cloze TestDirections: Read the following text carefully and choose the best answer from the four choices marked A, B, C, and D for each blank.Passage:In today’s rapidly evolving digital landscape, the role of social media has become increasingly significant. Social media platforms are not just tools for personal interaction; they also serve as powerful channels for business promotion and customer engagement. Companies are now leveraging these platforms to reach out to their target audience more effectively than ever before. However, the effectiveness of social media marketing (1)_on how well the company understands its audience and the specific platform being used. For instance, while Facebook may be suitable for reaching older demographics, Instagram is more popular among younger users. Therefore, it is crucial for businesses to tailor their content to fit the preferences and behaviors of the (2)_demographic they wish to target.Moreover, the rise of mobile devices has further transformed the way peopleconsume content online. The majority of social media users now access these platforms via smartphones, which means that companies must ensure that their content is optimized for mobile viewing. In addition, the speed at which information spreads on social media can be both a boon and a bane. On one hand, positive news about a brand can quickly go viral, leading to increased visibility and potentially higher sales. On the other hand, negative publicity can spread just as fast, potentially causing serious damage to a brand’s reputation. As such, it is imperative for companies to have a well-thought-out strategy for managing their online presence and responding to feedback in a timely and professional manner.In conclusion, social media offers unparalleled opportunities for businesses to connect with customers, but it requires careful planning and execution to (3)___the maximum benefits. By staying attuned to trends and continuously adapting their strategies, companies can harness the power of social media to foster growth and build strong relationships with their audiences.1.[A] relies [B] bases [C] stands [D] depends2.[A] particular [B] peculiar [C] special [D] unique3.[A] obtain [B] gain [C] achieve [D] accomplishAnswers:1.D - depends2.A - particular3.C - achieveThis cloze test is designed to assess comprehension and vocabulary skills, as well as the ability to infer the correct usage of words within the context of the passage. Each question is crafted to require understanding of the sentence structure and meaning to select the best option.二、传统阅读理解(本部分有4大题,每大题10分,共40分)第一题Passage:In the 1950s, the United States experienced a significant shift in the way people viewed education. This shift was largely due to the Cold War, which created a demand for a highly educated workforce. As a result, the number of students pursuing higher education in the U.S. began to grow rapidly.One of the most important developments during this period was the creation of the Master’s degree program. The Master’s degree was designed to provide students with advanced knowledge and skills in a specific field. This program became increasingly popular as more and more people realized the value of a higher education.The growth of the Master’s degree program had a profound impact on American society. It helped to create a more educated and skilled workforce, which in turn contributed to the nation’s economic growth. It also helped to improve the quality of life for many Americans by providing them with opportunities for career advancement and personal development.Today, the Master’s degree is still an important part of the American educational system. However, there are some challenges that need to be addressed. One of the biggest challenges is the rising cost of education. As the cost of tuition continues to rise, many students are unable to afford the cost of a Master’s degree. This is a problem that needs to be addressed if we are to continue to provide high-quality education to all Americans.1、What was the main reason for the shift in the way people viewed education in the 1950s?A. The demand for a highly educated workforce due to the Cold War.B. The desire to improve the quality of life for all Americans.C. The increasing cost of education.D. The creation of the Master’s degree program.2、What is the purpose of the Master’s degree program?A. To provide students with basic knowledge and skills in a specific field.B. To provide students with advanced knowledge and skills in a specific field.C. To provide students with job training.D. To provide students with a general education.3、How did the growth of the Master’s degree program impact American society?A. It helped to create a more educated and skilled workforce.B. It helped to improve the quality of life for many Americans.C. It caused the economy to decline.D. It increased the cost of education.4、What is one of the biggest challenges facing the Master’s deg ree program today?A. The demand for a highly educated workforce.B. The rising cost of education.C. The desire to improve the quality of life for all Americans.D. The creation of new educational programs.5、What is the author’s main point in the last pa ragraph?A. The Master’s degree program is still an important part of the American educational system.B. The cost of education needs to be addressed.C. The Master’s degree program is no longer relevant.D. The author is unsure about the future of the Master’s degree program.第二题Reading Comprehension (Traditional)Passage:The digital revolution has transformed the way we live, work, and communicate. With the advent of the internet and the proliferation of smart devices, information is more accessible than ever before. This transformation has had a profound impact on education, with online learning platforms providing unprecedented access to knowledge. However, this shift towards digital learningalso poses challenges, particularly in terms of ensuring equitable access and maintaining educational quality.While the benefits of digital learning are numerous, including flexibility, cost-effectiveness, and the ability to reach a wider audience, there are concerns about the potential for increased social isolation and the difficulty in replicating the dynamic, interactive environment of a traditional classroom. Moreover, not all students have equal access to the technology required for online learning, which can exacerbate existing inequalities. It’s crucial that as we embrace the opportunities presented by digital technologies, we also address these challenges to ensure that no student is left behind.Educators must adapt their teaching methods to take advantage of new tools while also being mindful of the need to foster a sense of community and support among students. By integrating both digital and traditional approaches, it’s possible to create a learning environment that leverages the strengths of each, ultimately enhancing the educational experience for all students.Questions:1、What is one of the main impacts of the digital revolution mentioned in the passage?•A) The reduction of social interactions•B) The increase in physical book sales•C) The transformation of communication methods•D) The decline of online learning platformsAnswer: C) The transformation of communication methods2、According to the passage, what is a challenge associated with digital learning?•A) The inability to provide any form of interaction•B) The potential to widen the gap between different socioeconomic groups •C) The lack of available content for online courses•D) The complete replacement of traditional classroomsAnswer: B) The potential to widen the gap between different socioeconomic groups3、Which of the following is NOT listed as a benefit of digital learning in the passage?•A) Cost-effectiveness•B) Flexibility•C) Increased social isolation•D) Wider reachAnswer: C) Increased social isolation4、The passage suggests that educators should do which of the following in response to the digital revolution?•A) Abandon all traditional teaching methods•B) Focus solely on improving students’ technical skills•C) Integrate digital and traditional teaching methods•D) Avoid using any digital tools in the classroomAnswer: C) Integrate digital and traditional teaching methods5、What is the author’s stance on the role of digital technologies ineducation?•A) They are unnecessary and should be avoided•B) They offer opportunities that should be embraced, but with caution •C) They are the only solution to current educational challenges•D) They have no real impact on the quality of educationAnswer: B) They offer opportunities that should be embraced, but with cautionThis reading comprehension exercise is designed to test your understanding of the text and your ability to identify key points and arguments within the passage.第三题Reading PassageWhen the French sociologist and philosopher Henri Lefebvre died in 1991, he left behind a body of work that has had a profound influence on the fields of sociology, philosophy, and cultural studies. Lefebvre’s theories focused on the relationship between space and society, particularly how space is produced, represented, and experienced. His work has been widely discussed and debated, with scholars and critics alike finding value in his insights.Lefebvre’s most famous work, “The Production of Space,” published in 1974, laid the foundation for his theoretical framework. In this book, he argues that space is not simply a container for human activities but rather an active agent in shaping social relationships and structures. Lefebvre introduces the concept of “three spaces” to describe the production of space: the perceived space,the lived space, and the representative space.1、According to Lefebvre, what is the primary focus of his theories?A. The development of urban planningB. The relationship between space and societyC. The history of architectural designD. The evolution of cultural practices2、What is the main argument presented in “The Production of Space”?A. Space is a passive entity that reflects social structures.B. Space is a fundamental building block of society.C. Space is an object that can be easily manipulated by humans.D. Space is irrelevant to the functioning of society.3、Lefebvre identifies three distinct spaces. Which of the following is NOT one of these spaces?A. Perceived spaceB. Lived spaceC. Representative spaceD. Economic space4、How does Lefebvre define the concept of “three spaces”?A. They are different types of architectural designs.B. They represent different stages of the production of space.C. They are different ways of perceiving and experiencing space.D. They are different social classes that occupy space.5、What is the significance of Lefebvre’s work in the fields of sociology and philosophy?A. It provides a new perspective on the role of space in social relationships.B. It offers a comprehensive guide to urban planning and development.C. It promotes the idea that space is an unimportant aspect of society.D. It focuses solely on the history of architectural movements.Answers:1、B2、B3、D4、C5、A第四题Reading Comprehension (Traditional)Read the following passage and answer the questions that follow. Choose the best answer from the options provided.Passage:In recent years, there has been a growing interest in the concept of “smart cities,” which are urban areas that u se different types of electronic data collection sensors to supply information which is used to manage assets and resources efficiently. This includes data collected from citizens, devices, andassets that is processed and analyzed to monitor and manage traffic and transportation systems, power plants, water supply networks, waste management, law enforcement, information systems, schools, libraries, hospitals, and other community services. The goal of building a smart city is to improve quality of life by using technology to enhance the performance and interactivity of urban services, to reduce costs and resource consumption, and to increase contact between citizens and government. Smart city applications are developed to address urban challenges such as environmental sustainability, mobility, and economic development.Critics argue, however, that while the idea of a smart city is appealing, it raises significant concerns about privacy and security. As more and more aspects of daily life become digitized, the amount of personal data being collected also increases, leading to potential misuse or unauthorized access. Moreover, the reliance on technology for critical infrastructure can create vulnerabilities if not properly secured against cyber-attacks. There is also a risk of widening the digital divide, as those without access to the necessary technologies may be left behind, further exacerbating social inequalities.Despite these concerns, many governments around the world are moving forward with plans to develop smart cities, seeing them as a key component of their future strategies. They believe that the benefits of improved efficiency and service delivery will outweigh the potential risks, provided that adequate safeguards are put in place to protect citizen s’ data and ensure the resilience of thecity’s technological framework.Questions:1、What is the primary purpose of developing a smart city?•A) To collect as much data as possible•B) To improve the quality of life through efficient use of technology •C) To replace all traditional forms of communication•D) To eliminate the need for human interaction in urban services2、According to the passage, what is one of the main concerns raised by critics regarding smart cities?•A) The lack of available technology•B) The high cost of implementing smart city solutions•C) Privacy and security issues related to data collection•D) The inability to provide essential services3、Which of the following is NOT mentioned as an area where smart city technology could be applied?•A) Traffic and transportation systems•B) Waste management•C) Educational institutions•D) Agricultural production4、How do some governments view the development of smart cities despite the criticisms?•A) As a risky endeavor that should be avoided•B) As a temporary trend that will soon pass•C) As a strategic move with long-term benefits•D) As an unnecessary investment in technology5、What does the term “digital divide” refer to in the context of smart cities?•A) The gap between the amount of data collected and the amount of data analyzed•B) The difference in technological advancement between urban and rural areas•C) The disparity in access to technology and its impact on social inequality•D) The separation of digital and non-digital methods of service delivery Answers:1、B) To improve the quality of life through efficient use of technology2、C) Privacy and security issues related to data collection3、D) Agricultural production4、C) As a strategic move with long-term benefits5、C) The disparity in access to technology and its impact on social inequality三、阅读理解新题型(10分)Reading Comprehension (New Type)Passage:The rise of e-commerce has transformed the way people shop and has had aprofound impact on traditional brick-and-mortar retailers. Online shopping offers convenience, a wide range of products, and competitive prices. However, it has also raised concerns about the future of physical stores. This passage examines the challenges and opportunities facing traditional retailers in the age of e-commerce.In recent years, the popularity of e-commerce has soared, thanks to advancements in technology and changing consumer behavior. According to a report by Statista, global e-commerce sales reached nearly$4.2 trillion in 2020. This upward trend is expected to continue, with projections showing that online sales will account for 25% of total retail sales by 2025. As a result, traditional retailers are facing fierce competition and must adapt to the digital landscape.One of the main challenges for brick-and-mortar retailers is the shift in consumer preferences. Many shoppers now prefer the convenience of online shopping, which allows them to compare prices, read reviews, and purchase products from the comfort of their homes. This has led to a decrease in foot traffic in physical stores, causing many retailers to struggle to attract customers. Additionally, the ability to offer a wide range of products at competitive prices has become a hallmark of e-commerce, making it difficult for traditional retailers to compete.Despite these challenges, there are opportunities for traditional retailers to thrive in the age of e-commerce. One approach is to leverage the unique strengths of physical stores, such as the ability to provide an immersiveshopping experience and personalized customer service. Retailers can also use technology to enhance the in-store experience, such as implementing augmented reality (AR) to allow customers to visualize products in their own homes before purchasing.Another strategy is to embrace the digital world and create a seamless shopping experience that integrates online and offline channels. For example, retailers can offer online returns to brick-and-mortar stores, allowing customers to shop online and return items in person. This not only provides convenience but also encourages customers to make additional purchases while they are in the store.Furthermore, traditional retailers can leverage their established brand loyalty and customer base to create a competitive advantage. By focusing on niche markets and offering unique products or services, retailers can differentiate themselves from e-commerce giants. Additionally, retailers can invest in marketing and promotions to drive traffic to their physical stores, even as more consumers turn to online shopping.In conclusion, the rise of e-commerce has presented traditional retailers with significant challenges. However, by embracing the digital landscape, leveraging their unique strengths, and focusing on customer satisfaction, traditional retailers can adapt and thrive in the age of e-commerce.Questions:1.What is the main concern raised about traditional retailers in the age of e-commerce?2.According to the passage, what is one of the main reasons for the decline in foot traffic in physical stores?3.How can traditional retailers leverage technology to enhance the in-store experience?4.What strategy is mentioned in the passage that involves integrating online and offline channels?5.How can traditional retailers create a competitive advantage in the age of e-commerce?Answers:1.The main concern is the fierce competition from e-commerce and the shift in consumer preferences towards online shopping.2.The main reason is the convenience and competitive prices offered by e-commerce, which make it difficult for traditional retailers to compete.3.Traditional retailers can leverage technology by implementing augmented reality (AR) and offering online returns to brick-and-mortar stores.4.The strategy mentioned is to create a seamless shopping experience that integrates online and offline channels, such as offering online returns to brick-and-mortar stores.5.Traditional retailers can create a competitive advantage by focusing on niche markets, offering unique products or services, and investing in marketing and promotions to drive traffic to their physical stores.四、翻译(本大题有5小题,每小题2分,共10分)First QuestionTranslate the following sentence into Chinese. Write your translation on the ANSWER SHEET.Original Sentence:“Although technology has brought about nume rous conveniences in our daily lives, it is also true that it has led to significant privacy concerns, especially with the rapid development of digital communication tools.”Answer:尽管技术在我们的日常生活中带来了诸多便利,但也不可否认它导致了重大的隐私问题,尤其是在数字通信工具快速发展的情况下。
Database Provenance
More than Accuracy*
*MIT IQ Program, Richard Wang, Director
Areas of Specialization
• Data Integrity
– The dependability and trustworthiness of information
Jerry Talburt
Why is ER Important?
• In the commercial world, ER is foundational to Customer Relationship Management (CRM) • In government, ER is a basic tool in law enforcement and intelligence analysis (“connecting-the-dots”)
• Applications
– Customer Integration & Recognition – Fraud Management
Law Enforcement ER
• Entities
– Persons of Interest – Networks and coalitions – Suspicious events
• Confidence in key measurements
– Accuracy, objectivity, reliability – Objectivity, relevance – Completeness, timeliness
Database Provenance
• University of Pennsylvania Database Research Group • Stanford University InfoLab • University of Maryland Database Research Group • University of Washington Database Research Group • Berkeley Database Research Group
数据挖掘_datamiming_韩家炜_03Preprocessing
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
12
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
9
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which require data cleaning duplicate records incomplete data inconsistent data
Data provenance – the foundation of data quality
Data provenance–the foundation of data qualityPeter BunemanUniversity of Edinburgh Edinburgh,UKopb@Susan B.DavidsonUniversity of Pennsylvania Philadelphia,USA susan@September1,20101What is Provenance?Provenance for data1is often defined by analogy with its use for the history of non-digital artifacts,typically works of art.While this provides a starting point for our understanding of the term,it is not adequate for at least two important reasons.First,an art historian seeing a copy of some artifacts will regard its provenance as very different from that of the original.By contrast,when we make use of a digital artifact,we often use a copy,and and copying in no sense destroys the provenance of that artifact.Second,digital artefacts are seldom“raw”data.They are created by a process of transformation or computation on other data sets;and we regard this process as a part of the provenance.For“non-digital”artefacts,provenance is usually traced back to the point of creation,but no further.These two issues–the copying and transformation of data–willfigure prominently in our discussion of provenance,but rather than attempting a more elaborate definition,we shall start with some examples.•In September2008,a news article concerning the2002bankruptcy of UAL,United Airlines’parent company,appeared on the list of top news stories from Google News.The(undated)article provoked investor panic and UAL’s share price dropped by75%before it was realized that the article was six years out of date.•Look up some piece of demographic information on the Web and see if you can get an accurate account of its provenance.A recent Web search for“population Corfu”yielded14different values in thefirst 20hits ranging from70,000to150,000,several of them given to the nearest person.While some of the values were dated,in only one case was any kind of attribution or citation given.•Most scientific data sets are not raw observations but have been generated by a complex analysis of the raw data with a mixture of scientific software and human input.Keeping a record of this process–the workflow–is all-important to a scientist who wants to use the data and be satisfied of its validity.Also,in science and scholarship generally,there has been a huge increase in the use of databases to replace the traditional reference works,encyclopedias,gazetteers etc.These curated databases are constructed with a lot of manual effort,usually by copying from other sources.Few of these keep a satisfactory record of the source.1The terms“pedigree”and“lineage”have also been used synonymously with provenance.1•The Department of Defense will be painfully aware of the1999bombing of the Chinese Embassy in Belgrade which was variously attributed to an out-of-date map or a misunderstanding of the procedure intended to locate the intended target–in either case a failure to have or to use provenance information.All these examples clearly indicate that,for data provenance,we need to understand the issues of data creation,transformation,and copying.In the following sections we shall briefly summarize the progress to date in understanding them and then outline some of the practical and major challenges that arise from this understanding.The topics treated here are,not surprisingly,related to the authors’interests.There are a number of surveys and tutorials on provenance that will give a broader coverage[BT07,SPG05,DF08, MFF+08,CCT09,BCTV08,FKSS08]2Models of ProvenanceAs interest in data provenance has developed over the past few years,researchers and software developers have realized that it is not an isolated issue.Apart from being a fundamental issue in data quality,connections have developed with other areas of computer science:•Query and update languages.•Probabilistic databases•Data integration•Data cleaning•Debugging schema transformations•File/data synchronization•Program debugging(program slicing)•Security and privacyIn some cases,such as data cleaning,data integration andfile synchronization,the connection is obvious: Provenance information may help to determine what to select among conflicting pieces of information.In other cases the connection is less obvious but nevertheless important.For example in program slicing the idea is to trace theflow of data through the execution of a program for debugging purposes.What has emerged as a result of these studies is the need to develop the right models of provenance.Two general models have been developed for somewhat different purposes:workflow and data provenance–also called coarse-andfine-grain provenance.Both of these concern the provenance of data but in different ways. Incidentally,we have used“data provenance”as a general term and forfine-grain provenance.As we shall see there is no hard boundary between coarse–andfine–grain provenance and the context should indicate the sense in which we are using the term.2.1Workflow provenanceAs indicated in our examples,scientific data sets are seldom“raw”.What is published is usually the result of a sophisticated workflow that takes the raw observations(e.g.from a sensor network or a microarray image)and subjects them to a complex set of data transformations that may involve informed input from a scientist.Keeping an accurate record of what was done is crucial if one wants to verify or,more importantly to repeat an experiment.2For example,the workflow in Figure1estimates disease susceptibility based on genome-wide SNP array data. The input to the workflow(indicated by the node or module labeled I)is a set of SNPs,ethnicity information, lifestyle,family history,and physical symptoms.Thefirst module within the root of the workflow(the dotted box labeled W1),M1,determines a set of disorders the patient is genetically susceptible to based on the input SNPs and ethnicity information.The second module,M2,refines the set of disorders the patient is at risk for based on their lifestyle,family history,and physical symptoms.M1and M2are complex processes, as indicated by theτexpansions to the subworkflows labeled W2and W3,respectively.To understand the prognosis for an individual(the output indicated by the module labeled O),it is important to be able to trace back through this complex workflow and see not only what processing steps were executed,but examine the intermediate data that is generated,in particular the results returned by OMIM and PubMed queries in W4 (modules M8and M9,respectively).2.2Data ProvenanceThis is concerned with the provenance of relatively small pieces of data.Suppose,for example,on a visit to your doctor,youfind that your office telephone number has been incorrectly recorded.It is probable that it was either entered incorrectly or somehow incorrectly copied from some other data set.In either case one would like to understand what happened in case the incorrect number occurs in other medical records. Here we do not want a complete workflow for all medical records–such a thing is probably impossible to construct.What we are looking for is some simple explanation of how that telephone number arrived at where you saw it,e.g.when and where it was entered and who copied it.More generally,as depicted in Figure2,the idea is to extract an explanation of how a small piece of data evolved,when one is not interested in the whole system.2.3Convergence of modelsIt is obvious that there has to be some convergence of these two models.In the case of a telephone number, the operation of interest is almost certainly copying and a“local”account of the source can be given.By contrast,in our workflow example,since we know nothing about the processing involved in M3“Expand SNP Set”,then each resulting SNP in the output set can only be understood to depend on all input SNPs and ethnicity information,rather than one particular SNP and ethnicity information.But most cases lie3Figure2:An informal view of data provenancesomewhere between these two extremes.First,even if we stay within the confines of relational query languages,we may be interested in how a record was formed.For example,the incorrect telephone number may have found its way into your medical record because that record was formed by the incorrect“join”of two other records.In this case you would not only be interested in where the telephone number was copied from,but also in how that record was constructed. To this end[GKT07]describes the provenance of a record by a simple algebraic term.This term can be thought of as a characterization of the“mini-workflow”that constructed the record.At the other end of the spectrum,workflow specification languages such as Taverna and Kepler do not simply treat their component programs and data sets as“black boxes”.They are capable of executing instructions such as applying a procedure to each member of a set or selecting from a set each member with a certain property(see,for example,[TMG+07]).These are similar to operations of the relational algebra,so within these workflow specifications it is often possible to extract somefiner grain provenance information. There is certainly hope[ABC+10]that further unification of models is possible,but whether there will ever be a single model that can be used for all kinds of provenance is debatable.There is already substantial variation in the models that have been developed for data provenance,and although there is a standardization effort for workflow provenance,it is not yet clear that there is one satisfactory provenance model for all varieties of workflow specification.3Capturing provenanceCreating the right models of provenance assumes that provenance is being captured to begin with.However, the current state of practice is largely manual population,i.e.humans must enter the information.We summarize some of the advances being made in this area:•Schema development:Several groups are reaching agreement on what provenance information should be recorded,whether the capture be manual or automatic.For example,within scientific workflow systems the Open Provenance Model(/),see also[MFF+08])has been developed to enable exchange between systems.Various government organizations are also defining the metadata that should be captured,e.g.PREMIS(Library of Congress),Upside(Navy Joint Air Missile Defense), and the Defense Discovery Metadata Specification(DDMS).•Automated capture:Several scientific workflow systems(e.g.Kepler,Taverna,and VisTrails)auto-matically capture processing steps as well as their input and output data.This“log data”is typically4stored as an XMLfile,or put in a relational database,to enable users to view provenance information.Other projects focus on capturing provenance at the operating system level,e.g.the Provenance Aware Storage System(PASS)project at Harvard.•Interception:Provenance information can also be“intercepted”at observation points in a system.For example,copy-paste operations could be intercepted so that where a piece of data was copied from is recorded.An experimental system of this kind has been proposed in[BCC06].Another strategy is to capture calls at multi-system“coordination points”(e.g.enterprise service buses).3.1Why capture is difficult and the evil of Ctrl-c Ctrl-vOne of the simplest forms of provenance to describe is that of manual copying of data.Much data,especially in curated databases is manually copied from one database to another.While describing this process is relatively simple and while it is a particularly important form of provenance to record,progress in realizing this has been slow.There are several reasons.First,the database into which a data element is being copied may have nofield in which to record provenance information;and providing,for each data value another value that describes provenance could double(or worse)the size of the database unless special techniques such as those suggested in[BCC06]are used. Second,the source database may have no accepted method of describing from where(from what location in the database)the data element was copied.This is related to the data citation problem.Third,if we are to assume that the provenance information is to be manually transferred or entered,we may be requiring too much of the database curators.Typically–and understandably–these people are more interested in extending the scope of their database and keeping it current than they are with the“scholarly record”.This last point illustrates that one of the major challenges is to change the mindset of people who manipulate and publish data and of the people who design the systems they use.For example,data items are often copied by a copy-paste operation,the unassuming Ctrl-c Ctrl-v keystrokes that are part of nearly every desktop environment.This is where much provenance gets lost.Altering the effect of this operation and the attitude of the people(all of us)who use it is probably the most important step to be made towards automatic recording of provenance.The foregoing illustrates that while,in many cases,our provenance models tell us what should be captured, our systems may require fundamental revision in order to do this.4Related topicsIn this section we briefly describe a number of topics concerning the use and quality of data that are closely related to provenance.4.1PrivacyAlthough capturing complete provenance information is desirable,making it available to all users may raise privacy concerns.For example,intermediate data within a workflow execution may contain sensitive infor-mation,such as the social security number,a medical record,orfinancial information about an individual;in our running example,the set of potential disorders may be confidential information.Although certain users performing an analysis with a workflow may be allowed to see such confidential data,making it available through a workflow repository,even for scientific purposes,is an unacceptable breach of privacy.Beyond5data privacy,a module itself may be proprietary,meaning that users should not be able to infer its behavior. While some systems may attempt to hide the meaning or behavior of a module by hiding its name and/or the source code,this does not work when provenance is revealed:Allowing the user to see the inputs and outputs of the module over a large number of executions reveals its behavior and violates module privacy (see[DKPR10,DKRCB10]for more details)..Finally,details of how certain modules in the workflow are connected may be proprietary,and therefore showing how data is passed between modules may reveal too much of the structure of the workflow;this can be thought of as structural privacy.There is therefore a tradeoffbetween the amount of provenance information that can be revealed and the privacy guarantees of the components involved.4.2Archiving,citation,and data currencyHere is a completely unscientific experiment that anyone can do.One of the most widely used sources of demographic data is the World Factbook2.Until recently it was published annually,but is now published on-line and updated more frequently.The following table contains for each of the last ten years,the Factbook estimate of the population of China(a10-digit number)and the number of hits reported by Google on that number at the time of writing(August2010)Year Population Google hits20101,338,612,96815,10020091,338,612,96815,10020081,330,044,5446,80020071,321,851,88815,70020061,313,973,7135,60020051,306,313,81220,70020041,298,847,6244,90020031,286,975,4683,28020021,284,303,7051,71020011,273,111,29032,70020001,261,832,482777The number of hits that a random10-digit number gets is usually less than100,so we have some reason to think that most of the pages contain an intended copy of a Factbookfigure for the population of China. Thefluctuation is difficult to explain:it may be to do with release dates of the Factbook;it may also be the result of highly propagated“shock”stories of population explosion.Does this mean that most Web references to the population of China are stale?Not at all.An article about the population of China written in2002should refer to the population of China in that year,not the population at the time someone is reading the article.However,a desultory examination of some of the pages that refer to oldfigures reveals that a substantial number of them are intended as“current”reference and should have been updated.What this does show up is the need for archiving evolving data sets.Not only should provenance for these figures be provided(it seldom is)but it should also be possible to verify that the provenance is correct.It is not at all clear who,if anyone,is responsible for archiving,and publishing archives of,an important resource like the Factbook.The problem is certainly aggravated by the fact that the Factbook is now continually updated rather than being published in annual releases.The story is the same for many curated data sets,even though tools have been developed for space-efficient archiving[BKTT04],few publishers of on-line data do a good job of keeping archives.What this means is that,even if people keep provenance information,the provenance trail can go dead.2The Central Intelligence Agency World Factbook https:///library/publications/the-world-factbook/6Along with archiving is the need to develop standards for data citation.There are well-developed standards–several of them–for traditional citations.The important observation is that citations carry more than simple provenance information about where the relevant information originates–it also carries useful information such as authorship,a brief description–a title–and other useful context information.It is often useful to store this along with provenance information or to treat it as part of provenance information.It is especially important to do this when,as we have just seen,the source data may no longer exist.5ConclusionProvenance is fundamental to understanding data quality.While models for copy-paste provenance[BCC06], database-style provenance[CCT09,BCTV08]and workflow provenance[DF08,FKSS08]are starting to emerge,there are a number of problems that require further understanding,including:operating across heterogeneous models of provenance;capturing provenance;compressing provenance;securing and verifying provenance;efficiently searching and querying provenance;reducing provenance overload;respecting privacy while revealing provenance;and provenance for evolving data sets.The study of provenance is also causing us to rethink established systems for information storage.In databases,provenance has shed new light on the semantics of updates;in ontologies it is having the even more profound effect of calling into question whether the three-column organization of RDF is adequate. We have briefly discussed models of provenance in this paper.While there is much further research needed in this area,it is already clear that the major challenges to capture of provenance are in engineering the next generation of programming environments and user interfaces and in changing the mind-set of the publishers of data to recognize the importance of provenance.References[ABC+10]Umut Acar,Peter Buneman,James Cheney,Jan Van den Bussche,Natalia Kwasnikowska,and Stijn Vansummeren.A graph model of data and workflow provenance.In Theory and Practiceof Provenance,2010.[BCC06]Peter Buneman,Adriane Chapman,and James Cheney.Provenance management in curated databases.In Proceedings of ACM SIGMOD International Conference on Management of Data,pages539–550,2006.[BCTV08]Peter Buneman,James Cheney,Wang-Chiew Tan,and Stijn Vansummeren.Curated databases.In PODS’08:Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems,pages1–12,New York,NY,USA,2008.ACM.[BKTT04]Peter Buneman,Sanjeev Khanna,Keishi Tajima,and Wang-Chiew Tan.Archiving Scientific Data.ACM Transactions on Database Systems,27(1):2–42,2004.[BT07]Peter Buneman and Wang Chiew Tan.Provenance in databases.In SIGMOD Conference, pages1171–1173,2007.[CCT09]James Cheney,Laura Chiticariu,and Wang-Chiew Tan.Provenance in databases:Why,how, and where.Foundations and Trends in Databases,1(9):379–474,2009.[DF08]Susan B.Davidson and Juliana Freire.Provenance and scientific workflows:challenges and opportunities.In SIGMOD Conference,pages1345–1350,2008.7[DKPR10]Susan B.Davidson,Sanjeev Khanna,Debmalya Panigrahi,and Sudeepa Roy.Preserving mod-ule privacy in workflow provenance.Manuscript available at /abs/1005.5543.,2010.[DKRCB10]Susan B.Davidson,Sanjeev Khanna,Sudeepa Roy,and Sarah Cohen-Boulakia.Privacy issues in scientific workflow provenance.In Proceedings of the1st International Workshop on WorkflowApproaches for New Data-Centric Science,June2010.[FKSS08]Juliana Freire,David Koop,Emanuele Santos,and Cl´a udio T.Silva.Provenance for computa-tional tasks:A puting in Science and Engineering,10(3):11–21,2008.[GKT07]Todd J.Green,Gregory Karvounarakis,and Val Tannen.Provenance semirings.In PODS, pages31–40,2007.[MFF+08]Luc Moreau,Juliana Freire,Joe Futrelle,Robert E.McGrath,Jim Myers,and Patrick Paulson.The open provenance model:An overview.In IPAW,pages323–326,2008.[SPG05]Yogesh Simmhan,Beth Plale,and Dennis Gannon.A survey of data provenance in e-science.SIGMOD Record,34(3):31–36,2005.[TMG+07]Daniele Turi,Paolo Missier,Carole A.Goble,David De Roure,and Tom Oinn.Taverna workflows:Syntax and semantics.In eScience,pages441–448,2007.8。
基础数据维护英语
基础数据维护英语Basic Data MaintenanceData is the backbone of any organization, and its effective maintenance is crucial for ensuring the smooth operation and decision-making processes. Basic data maintenance encompasses a wide range of activities, from data collection and organization to data validation and updating. In this essay, we will explore the importance of basic data maintenance, the key principles, and the best practices for ensuring the integrity and reliability of an organization's data.Importance of Basic Data MaintenanceThe importance of basic data maintenance cannot be overstated. Accurate and up-to-date data is the foundation upon which an organization's operations, analyses, and strategic decisions are built. Without proper data maintenance, organizations face the risk of making decisions based on incomplete, inaccurate, or outdated information, which can lead to significant financial and operational consequences.Effective data maintenance helps organizations to:1. Ensure data accuracy and reliability: By regularly reviewing and updating data, organizations can minimize the risk of errors, inconsistencies, and inaccuracies, which can compromise the integrity of their data.2. Improve decision-making: With reliable and up-to-date data, organizations can make more informed and strategic decisions, leading to better outcomes and more effective resource allocation.3. Enhance operational efficiency: Maintaining accurate data can streamline various business processes, such as customer relations, supply chain management, and financial reporting, leading to increased productivity and cost savings.4. Comply with regulatory requirements: Many industries have specific data management and reporting requirements, and proper data maintenance is essential for ensuring compliance.5. Strengthen data security: Regularly reviewing and updating data can help organizations identify and address potential security vulnerabilities, reducing the risk of data breaches and other cyber threats.Key Principles of Basic Data MaintenanceEffective basic data maintenance is guided by several key principles:1. Data governance: Establishing a clear data governance framework, which defines roles, responsibilities, and policies for data management, is essential for ensuring the consistency and reliabilityof data across the organization.2. Data standardization: Implementing consistent data standards, formats, and conventions across the organization can greatly improve the accuracy and interoperability of data.3. Data validation: Regularly verifying the accuracy, completeness, and integrity of data through various validation techniques, such as data profiling, data cleansing, and deduplication, is crucial for maintaining data quality.4. Data backup and recovery: Implementing robust backup and disaster recovery strategies can help organizations safeguard their data and ensure business continuity in the event of data loss or system failures.5. Data security and access control: Establishing robust data security measures, such as access control, encryption, and logging, can help protect sensitive data from unauthorized access, tampering, or theft.Best Practices for Basic Data MaintenanceTo ensure the successful implementation of basic data maintenance, organizations should adopt the following best practices:1. Develop a comprehensive data maintenance plan: Create a detailed plan that outlines the data maintenance activities, responsibilities, timelines, and performance metrics to be used for monitoring and improving the data maintenance process.2. Automate data maintenance tasks: Leverage technology solutions,such as data management software and scripts, to automate repetitive data maintenance tasks, reducing the risk of human error and improving efficiency.3. Implement data quality monitoring: Regularly monitor data quality metrics, such as data completeness, accuracy, and timeliness, to identify and address any data quality issues in a timely manner.4. Foster a data-driven culture: Encourage a culture of data stewardship and accountability across the organization, where employees at all levels are responsible for maintaining the accuracy and integrity of the data they use or generate.5. Provide data maintenance training: Offer ongoing training and support to employees involved in data maintenance activities, ensuring they have the necessary skills and knowledge to perform their tasks effectively.6. Continuously review and improve: Regularly review the data maintenance process, identify areas for improvement, and implement changes to enhance the efficiency and effectiveness of the data maintenance program.ConclusionBasic data maintenance is a fundamental aspect of data management, and its importance cannot be overstated. By adhering to key principles and best practices, organizations can ensure the accuracy, reliability, and security of their data, enabling informed decision-making, improved operational efficiency, and enhancedcompliance with regulatory requirements. As the volume and complexity of data continue to grow, the need for effective basic data maintenance will only become more crucial for organizations seeking to thrive in an increasingly data-driven business landscape.。
数据的重要性英语作文
数据的重要性英语作文Data's ImportanceData has become an integral part of our daily lives, permeating every aspect of our society. From the way we communicate and make decisions to how we conduct business and govern, data has become the foundation upon which our modern world is built. The importance of data cannot be overstated, as it has the power to transform our lives, drive innovation, and shape the future.One of the most significant ways in which data has impacted our lives is in the realm of decision-making. In the past, decisions were often made based on intuition, experience, or limited information. However, with the advent of big data and advanced analytics, we now have access to vast amounts of information that can be used to make more informed and data-driven decisions. This has led to more effective problem-solving, better resource allocation, and more efficient processes across a wide range of industries.In the business world, data has become a crucial asset. Companies that are able to effectively collect, analyze, and leverage data are better equipped to understand their customers, identify new marketopportunities, and optimize their operations. By using data to make informed decisions, businesses can improve their bottom line, increase their competitiveness, and stay ahead of the curve in their respective industries.The impact of data extends far beyond the business world, however. In the realm of healthcare, for example, data is being used to develop more personalized and effective treatments, predict disease outbreaks, and improve patient outcomes. By analyzing large datasets, researchers and healthcare providers can identify patterns and trends that can inform medical decisions and lead to better patient care.Similarly, in the field of education, data is being used to personalize learning experiences, identify at-risk students, and improve educational outcomes. By analyzing student performance data, educators can tailor their teaching methods to the specific needs of their students, ensuring that every child has the opportunity to succeed.In the realm of public policy, data is also playing a crucial role. Governments and policymakers are using data to inform their decision-making, develop more effective policies, and better allocate resources. By analyzing data on issues such as crime, poverty, and public health, policymakers can identify and address the root causesof societal problems, leading to more effective and equitable solutions.Beyond these practical applications, data also has the potential to drive innovation and scientific discovery. By analyzing large datasets, researchers and innovators can uncover new insights, identify patterns, and develop groundbreaking solutions to some of the world's most pressing challenges. From climate change to space exploration, data is playing a crucial role in advancing our understanding of the world around us and shaping the future.Despite the many benefits of data, it is important to recognize that the use of data also comes with significant ethical and privacy concerns. As the amount of data we collect and store continues to grow, there is an increasing need to ensure that this data is being used responsibly and in a way that respects individual privacy and civil liberties. This requires the development of robust data governance frameworks, as well as ongoing dialogue and collaboration between policymakers, technology companies, and the public.In conclusion, the importance of data cannot be overstated. Data has the power to transform our lives, drive innovation, and shape the future. As we continue to navigate the rapidly evolving digital landscape, it is essential that we recognize the value of data andwork to ensure that it is being used in a way that benefits society as a whole. By embracing the power of data and addressing the ethical and privacy concerns that come with it, we can unlock the full potential of this invaluable resource and create a better, more informed, and more equitable world.。
principles of data science -回复
principles of data science -回复[Principles of Data Science]Data science is an interdisciplinary field that combines statistical analysis, machine learning, and computer programming to extract insights and knowledge from data. It is rapidly growing in importance as organizations recognize the value of data-driven decision-making. In this article, we will explore the principles of data science and how they guide practitioners in the field.1. Data Collection: The first step in any data science project is to collect the necessary data. This can involve gathering data from various sources such as databases, web scraping, or through the use of APIs. It is important to ensure that the data collected is relevant, accurate, and unbiased to obtain meaningful results.2. Data Cleaning: Once the data has been collected, it often requires cleaning and preprocessing before it can be analyzed. This involves tasks such as removing duplicate entries, handling missing values, and standardizing the format of the data. Cleaning the data is crucial to improve the quality of analysis and to prevent inaccurate insights based on faulty or incomplete data.3. Exploratory Data Analysis: Exploratory data analysis (EDA) involves examining the data visually and statistically to gain an initial understanding of its properties. This phase helps identify patterns, correlations, and outliers within the data. EDA can be conducted using techniques such as data visualization, summary statistics, and hypothesis testing. It provides a foundation for further analysis and modeling.4. Feature Engineering: Feature engineering involves selecting and transforming the variables in the dataset to create meaningful predictors for the model. This process may include feature creation, scaling, dimensionality reduction, and encoding categorical variables. Careful feature engineering can significantly impact the performance and interpretability of the model.5. Model Building: Once the data has been processed, it is time to develop a model to predict or explain the outcome of interest. The choice of the appropriate model depends on the nature of the problem and the characteristics of the data. Some common models used in data science include linear regression, decision trees, support vector machines, and neural networks. It is important toevaluate the performance of the model using appropriate metrics and validation techniques to ensure its reliability and generalizability.6. Model Evaluation: Evaluating the model is crucial to assess its performance and identify areas for improvement. This can be done using metrics such as accuracy, precision, recall, and F1 score. Additionally, visualizing the model's performance through techniques like confusion matrices or ROC curves can provide deeper insights. The evaluation phase helps determine if the model is suitable for deployment or if further iterations and improvements are needed.7. Model Deployment: Once a satisfactory model has been developed and evaluated, it can be deployed for use in production systems. This may involve integrating the model into existing software or creating a stand-alone application. Ensuring scalability, maintenance, and monitoring of the deployed model is vital to sustain its accuracy and efficiency over time.8. Model Interpretability: In certain applications, understanding how the model makes predictions is essential, especially when itaffects human decision-making. Model interpretability techniques like feature importance, partial dependence plots, and SHAP values help explain the contribution of each predictor to the model's output. This transparency fosters trust in the model and facilitates decision-making based on its results.9. Continuous Learning: Data science is a dynamic field, and models may require updating as new data becomes available or as the problem context evolves. Continuous learning involves monitoring and updating models as necessary to ensure their continued accuracy and relevance. This iterative process allows data scientists to adapt and improve their models over time.10. Ethical Considerations: As with any powerful technology, data science raises ethical concerns. Data scientists must be aware of potential biases in the data, ensure privacy and data protection, and consider the ethical implications of their work. Adhering to ethical guidelines and regulations is crucial to ensure responsible and accountable use of data science.In conclusion, the principles of data science provide a systematic approach to extracting insights from data. From data collection tomodel deployment, each step requires careful consideration to generate meaningful and reliable results. By following these principles, data scientists can unlock the potential of data and make informed decisions that positively impact organizations and society.。
网上购物的缺点英语作文
Online shopping,though incredibly convenient and popular,is not without its drawbacks.Here are some of the key disadvantages associated with this modern way of purchasing goods:ck of Physical Interaction:One of the most significant downsides of online shopping is the inability to physically inspect or try on the products before buying.This can lead to disappointment when the item does not meet expectations.2.Limited Product Information:While online descriptions and images can provide some information,they may not always be comprehensive.Consumers may miss out on important details that would have been apparent in a physical store.3.Risk of Fraud:Shopping online comes with the risk of encountering fraudulent websites or sellers.Scams and identity theft are genuine concerns that can lead to financial loss.4.Delivery Delays and Costs:Online orders can be delayed due to various factors such as shipping logistics,customs,or simply high demand.Additionally,shipping costs can sometimes make the overall price of an item more expensive than it would be in a physical store.5.Return and Exchange Hassles:Returning or exchanging items can be more complicated and timeconsuming than instore transactions.The process often involves additional shipping costs and waiting times.6.Limited Customer Service:While online chat and email support are available,they may not be as immediate or personalized as facetoface interactions with sales staff in a physical store.7.Impulse Buying:The ease and convenience of online shopping can lead to impulse purchases,which can result in buyers remorse or financial strain.8.Environmental Impact:The packaging and shipping of online purchases contribute to waste and carbon emissions.The environmental footprint of online shopping is often higher than that of inperson shopping.9.Data Privacy Concerns:To make purchases online,consumers must provide personal and financial information,which can be a concern for those worried about data privacy and security.10.Impact on Local Economies:Online shopping can negatively affect local businesses and economies,as it may lead to the closure of brickandmortar stores.ck of Social Experience:Shopping in person can be a social activity,providing opportunities to interact with others and enjoy the shopping experience.Online shopping lacks this social aspect.12.Dependence on Technology:Online shopping requires a reliable internet connection and access to a computer or smartphone.Those without such access may be excluded from the conveniences of online shopping.Despite these drawbacks,many consumers continue to choose online shopping for its convenience,variety,and often lower prices.However,its essential to be aware of these potential issues and take precautions to ensure a safe and satisfying online shopping experience.。
关注数字难民英语作文
关注数字难民英语作文下载温馨提示:该文档是我店铺精心编制而成,希望大家下载以后,能够帮助大家解决实际的问题。
文档下载后可定制随意修改,请根据实际需要进行相应的调整和使用,谢谢!并且,本店铺为大家提供各种各样类型的实用资料,如教育随笔、日记赏析、句子摘抄、古诗大全、经典美文、话题作文、工作总结、词语解析、文案摘录、其他资料等等,如想了解不同资料格式和写法,敬请关注!Download tips: This document is carefully compiled by theeditor. I hope that after you download them,they can help yousolve practical problems. The document can be customized andmodified after downloading,please adjust and use it according toactual needs, thank you!In addition, our shop provides you with various types ofpractical materials,such as educational essays, diaryappreciation,sentence excerpts,ancient poems,classic articles,topic composition,work summary,word parsing,copyexcerpts,other materials and so on,want to know different data formats andwriting methods,please pay attention!The world is currently facing a new challenge the rise of digital refugees, or what some call "digital nomads". These are individuals who are forced to leave their homes and seek refuge in the digital world due to various reasons such as political instability, economic hardship, or simply the desire for a more flexible lifestyle.In this digital age, where everything is interconnected and accessible with just a few clicks, it is no surprise that people are turning to the virtual world as a means of survival. The internet has become their new home, providing them with opportunities to work remotely, connect with others, and even build new communities.However, being a digital refugee is not without its challenges. The lack of a physical address or legal status can make it difficult for these individuals to access basic services such as healthcare or education. They are often left to navigate the complexities of the online world ontheir own, facing issues such as cyberbullying, online scams, and identity theft.Despite these challenges, being a digital refugee also has its advantages. The internet has opened up a world of possibilities for these individuals, allowing them to pursue their passions and talents on a global scale. They can now collaborate with others from different corners of the world, share their ideas and creations, and even monetize their skills through platforms such as freelancing websites or social media.But perhaps the most significant impact of the rise of digital refugees is the blurring of boundaries and the breaking down of traditional societal norms. The internet has created a space where individuals can express themselves freely, without the constraints of their physical surroundings. It has given a voice to those who have been silenced, allowing them to share their stories and experiences with the world.In conclusion, the rise of digital refugees is aphenomenon that cannot be ignored. It is reshaping the way we live, work, and connect with others. While it brings about both challenges and opportunities, it ultimately reflects the power of the digital world to empower individuals and transcend traditional boundaries. As we continue to navigate this new reality, it is important to ensure that no one is left behind and that everyone has equal access to the opportunities that the digital world offers.。
ABSTRACT Provenance Management in Curated Databases
Provenance Management in Curated DatabasesABSTRACTCurated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation,correc-tion and transfer of data from other sources.Provenance information concerning the creation,attribution,or version history of such data is crucial for assessing its integrity and scientific value.General purpose database systems provide little support for tracking provenance,especially when data moves among databases.This paper investigates general-purpose techniques for recording provenance for data that is copied among databases.We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database,in order to record the user’s actions in a convenient,queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.Our experiments show that although the over-head of a na¨ıve approach is fairly high,it can be decreased to an acceptable level using simple optimizations.Categories and Subject DescriptorsH.2.8[Database Applications]:Scientific databasesGeneral TermsAlgorithms,Design,PerformanceKeywordsprovenance,annotations,scientific databases,curation 1.INTRODUCTIONModern science is becoming increasingly dependent on databases.This poses new challenges for database technol-ogy,many of them to do with scale and distributed process-ing[13].However there are other issues concerned with the preservation of the“scientific record”–how and from where Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Copyright200X ACM X-XXXXX-XX-X/XX/XX...$rmation was obtained.These issues are particularly im-portant as database technology is employed not just to pro-vide access to source data,but also to the derived knowledge of scientists who have interpreted the data.Many scientists believe that provenance,or metadata describing creation, recording,ownership,processing,or version history,is es-sential for assessing the value of such data.However,prove-nance management is not well understood;there are few guidelines concerning what information should be retained and how it should be managed.Current database technol-ogy provides little assistance for managing provenance.In this paper we study the problem of tracking provenance of scientific data in curated databases,databases constructed by the“sweat of the brow”of scientists who manually assim-ilate information from several sources.First,it is important to understand the working practices and values of the sci-entists who maintain and use such databases.1.1Curated DatabasesThere are several hundred public-domain databases in the field of molecular biology[9].Few contain raw experimental data;most represent an investment of a substantial amount of effort by individuals who have organized,interpreted or re-interpreted,and annotated data from other sources.The Uniprot[24]consortium lists upwards of seventy scientists, variously called curators or annotators,whose job it is to add to or correct the reference databases published by the consortium.At the other end of the scale there are relatively small databases managed by a single individual,such as the Nuclear Protein Database[10].These databases are highly valued and have,in some cases,replaced paper publication as the medium of communication.Such databases are not confined to biology;they are also being developed in areas such as astronomy and geology.Reference manuals,dictio-naries and gazetteers that have recently moved from paper publication to electronic dissemination are also examples of curated databases.One of the characteristics of curated databases is that much of their content has been derived or copied from other sources,often other curated databases.Most curators be-lieve that additional record keeping is needed to record where the data comes from–its provenance.However,there are few established guidelines for what provenance information should be retained for curated databases,and little support is given by databases or surrounding technology for captur-ing provenance information.There has been some examina-tion[2,8,16,22]of provenance issues in data warehouses; that is,views of some underlying collection of data.But curated databases are not warehouses:they are manually(a)MyDBABC1O95477CRPSwissProt(b)MyDBABC1SwissProt−PTM(c)MyDBOMIMNCBI600046PublicationsPTMPTMABC1(d)Publications ABC1MyDB12504680P02741PubMed 123 6512NP_005493Figure 1:A biological database curation example.Dashed lines represent provenance links.constructed by highly skilled scientists,not computed auto-matically from existing data sets.1.1.1ExampleA molecular biologist is interested in how age and choles-terol efflux affect cholesterol levels and coronary artery dis-ease.She keeps a simple database of proteins which may play a role in these systems;this database could be any-thing from a flat text or XML file to a full RDBMS.One day,while browsing abstracts of recent publications,she dis-covers some interesting proteins on SwissProt,and copies the records from a SwissProt web page into her database (Figure 1(a)).She then (Figure 1(b))fixes the new entries so that the PTM (post-translational modification)found in SwissProt is not confused with PTMs in her database found from other sites.She also (Figure 1(c))copies some pub-lication details from Online Mendelian Inheritance in Man (OMIM)and some other related data from NCBI.Finally (Figure 1(d)),she notices a mistake in a PubMed publica-tion number and corrects it.This manual curation process is repeated many times as the researcher conducts her in-vestigation.One year later,when reviewing her information,she finds a discrepancy between two PTMs and the conditions under which they are found.Unfortunately,she cannot remember where the anomalous data came from,so cannot trace it to the source to resolve the conflict.Moreover,the databases from which the data was copied have changed;searching for the same data no longer gives the same results.The biolo-gist may have no choice but to discard all of the anomalous data or spend a few hours tracking down the correct values.This would be especially embarrassing if the researcher had already published an article or version of her database based on the now-suspect data.In some respects,the researcher was better offin the days of paper publication and record keeping,where there are well-defined standards for citation and some confidence that the cited data will not change.To recover these advantages for curated databases,it is necessary to retain provenance information describing the source and version history of the data.In Figure 1,this information is represented by the dashed lines connecting source data to copied data.The current approach to managing provenance in curated databases is for the database designer to augment the schema with fields to contain provenance data [1,18]and require curators to add and maintain the provenance information themselves.Such manual bookkeeping is time consuming and seldom performed.It should not be necessary.We be-lieve it is imperative to find ways of automating the process.1.2The problemThe term “provenance”has been used in a variety of senses in database and scientific computation research.One form of provenance is “workflow”or “coarse-grained”prove-nance:information describing how derived data has been calculated from raw observations [3,11,14,21].Workflow provenance is important in scientific computation,but is not a major concern in curated databases.Instead,we focus on “fine-grained”or “dataflow”provenance,which describes how data has moved through a network of databases.Specifically,we consider the problem of tracking and man-aging provenance describing the user actions involved in con-structing a curated database.This includes recording both local modifications to the database (inserting,deleting,and updating data)and global operations such as copying data from external sources.Because of the large number and va-riety of scientific databases,a realistic solution to this prob-lem is subject to several constraints.The databases are all maintained independently,so it is (in the short term)unre-alistic to expect all of them to adopt a standard for storing and exchanging provenance.A wide variety of data mod-els are in use,and databases have widely varying practices for identifying or locating data.While the databases are not actively uncooperative,they may change silently and past versions may not be archived.Curators employ a wide variety of application programs,computing platforms,etc.,including proprietary software that cannot be changed.In light of these considerations,we believe it is reason-able to restrict attention to a subproblem that is simple enough that some progress can be made,yet which we be-lieve provides benefits for a common realistic situation faced by database curators.Specifically,we will address the issue of how to track the provenance of data as it enters a curated database via inserts and copies,and how it changes as aFigure 2:Provenance architectureresult of local updates.1.3Our approachIn this paper,we propose and evaluate a practical ap-proach to provenance tracking for data copied manually among databases.In our approach,we assume that the user’s actions are captured as a sequence of insert,delete,copy,and paste actions by a provenance-aware application for browsing and editing databases.As the user copies,in-serts,or deletes data in her local database T ,provenance links are stored in an auxiliary provenance database P .These links relate data locations in T with locations in previous versions of T or in external source databases S .They can be used after the fact to review the process used to construct the data in T ;in addition,if T is also being archived,the provenance links can provide further detail about how each version of T relates to the next.The architecture is summarized in Figure 2.The new components are shaded,and the existing (and unchange-able)components are unshaded.The shaded triangles indi-cate wrappers mapping S 1,...,S n ,T to an XML view;the database P stores provenance information describing the up-dates performed by the editor.Alternatively,provenance in-formation could be stored as annotations alongside data in T ;however,this would require changing the structure of T .The only requirement we make is that there is a canonical location for every data element.We shall describe this in more detail shortly.When provenance information is tracked manually or by a custom-built system,the user or designer typically decides what provenance information to record on a case-by-case ba-sis.In contrast,our system records everything.The obvious concern is that the processing and storage costs for doing this could be unacceptably high.The main contribution of this paper is to show how such fine-grained provenance information can be tracked,stored,and queried efficiently.We have implemented our approach and experimented with a number of ways of storing and querying provenance information,including a na ¨ıve approach and several more sophisticated techniques.Our results demonstrate that the processing overhead of the na ¨ıve approach is fairly high;it can increase the time to process each update by 28%,and the amount of provenance information stored is proportional to the size of the changed data.In addition,we also inves-tigated the impact of two optimization techniques:transac-tional and hierarchical provenance management.Together,these optimizations typically reduce the added processing cost of provenance tracking to less than 5–10%per operation and reduce the storage cost by a factor of 5–7relative to the na ¨ıve approach;moreover,the storage overhead is bounded by the lesser of the number of update operations and the amount of data touched.In addition,typical provenance queries can be executed more efficiently on such provenance records.We believe that these results make a compelling argument for the feasibility of our approach to provenance management.1.4OutlineThe structure of the rest of this paper is as follows.Sec-tion 2presents the conceptual foundation of our approach to provenance tracking.Section 3presents the implemen-tation of CPDB,an instance of our approach that uses the Timber XML database [15].In Section 4,we present and analyze the experimental results.We discuss additional re-lated work,future work,and conclude in Sections 5–6.2.UPDATES AND PROVENANCEIn order to discuss provenance we need to be able to de-scribe where a piece of data comes from ;that is,we need to have a means for describing the location of any data el-ement.We make two assumptions about the data,which are already used in file synchronization [12]and database archiving [4]and appear to hold for a wide variety of scien-tific and other databases.The first is that the database can be viewed as a tree;the second is that the edges of that tree can be labeled in such a way that a given sequence of labels occurs on at most one path from the root and therefore iden-tifies at most one data element.Traditional hierarchical file systems are a well-known example of this kind of structure.Relational databases also can be described hierarchically.For instance,the data values in a relational database can be addressed using four-level paths where DB/R/tid/F ad-dresses the field value F in the tuple with identifier or key tid in table R of database DB .Scientific databases already use paths such as SwissProt/Release {20}/Q01780to identify a specific entry,and this can be concatenated with a path such as Citation {3}/Title to identify a data element.XML data can be addressed by adding key information [4].Note that this is a general assumption that is orthogonal to the data models in use by the various databases.Formally,we let Σbe a set of labels,and consider paths p ∈Σ∗as addresses of data in trees.The trees t we consider are unordered and store data values from some domain D only at the leaves.Such trees are written as {a 1:v 1,...,a n :v n },where v i is either a subtree or data value.We write t.p for the subtree of t rooted at location p .We next describe a basic update language that captures the user’s actions,and the semantics of such updates.The atomic update operations are of the form where U::=ins {a :v }into p |del a from p |copy q into pThe insert operation inserts an edge labeled a with value v into the subtree at p ;v can be either the empty tree or a data value.The delete operation deletes an edge and its subtree.The copy operation replaces the subtree at p with a copy of the subtree at location q .We write sequences of atomic updates as U 1;...;U n .We write [[U ]]for the function on trees induced by the update sequence U .The precisecopy S1/a2/y into T/c1/y;insert {c2:{}}into T;copy S1/a2into T/c2;insert {y :12}into T/c2;insert {c3:{}}into T;copy S1/a3into T/c3;copy S2/b3/y into T/c3;insert {c4:{}}into T;copy S2/b2into T/c4;insert {y :13}into T/c4;Figure 3:An example copy-paste update operation.a2xx yxy13527xyyFigure 4:An example of executing the update in Figure 3.The upper two trees S 1,S 2are XML views of source databases;the bottom trees T ,T are XML views of part of the target database at the beginning and end of the transaction.White nodes are nodes already in the target database;black nodes repre-sent inserted nodes;other shadings indicate whether the node came from S 1or S 2.Dashed lines indicate provenance links.Additional provenance links can be inferred from context.semantics of the operations is as follows.[[ins {a :v }into p ]](t )=t [p :=t.p {a :v }][[del a from p ]](t )=t [p :=t.p −a ][[copy q into p ]](t )=t [p :=t.q ][[U ;U ]](t )=[[U ]]([[U ]](t ))Here,t u denotes the tree t with subtree u added;this fails if there are any shared edge names in t and u ;t −a denotes the result of deleting the node labeled a ,failing if no such node exists;and t [p :=u ]denotes the result of replacing the subtree of t at p by u ,failing if path p is not present in t .Insertions,copies,and deletes can only be performed in a subtree of the target database T .As an example,consider the update operations in Fig-ure 3.These operations copy some records from S 1and S 2,then modify some of the field values.The result of executing this update operation on database T with source databases S 1,S 2is shown in Figure 4.The initial version of the target database is labeled T ,while the version after the transaction is labeled T .2.1Provenance trackingFigure 4depicts provenance links (dashed lines)that con-nect copied data in the target database with source data.Of course,these links are not visible in the actual result of the update.In our approach,these links are stored “on the side”in an auxiliary table Prov (T id,Op,T o,F rom ),where T id is a sequence number for the transaction that made the corresponding change;Op is one of I (insert),C (copy),or D (delete);F rom is the old location (for a copy or delete),and T o is the location of the new data (for an insert or copy).The T o fields of deletes and F rom fields of inserts are ignored;we assume their values are null (⊥).Additional information about the transaction,such as commit time and user identity can be stored in a separate table.We shall now describe several ways of storing provenance information.2.1.1Na¨ıve provenanceThe most straightforward method is to store one prove-nance record for each copied,inserted,or deleted node.In addition,each update operation is treated as a separate transaction.This technique may be wasteful in terms of space,because it introduces one provenance record for ev-ery node inserted,deleted,or copied throughout the update.However,it retains the maximum possible information about the user’s actions.In fact,the exact update operation de-scribing the user’s sequence of actions can be recovered from the provenance table.2.1.2Transactional provenanceThe second method is to assume the updated actions are grouped into transactions larger than a single operation,and to store only provenance links describing the net changes re-sulting from a transaction.For example,if the user copies data from S 1,then on further reflection deletes it and uses data from S 2instead,and finally commits,this has the same effect on provenance as if the user had only copied the data from S 2.Thus,details about intermediate states or tem-porary data storage in between consistent official database versions are not retained.Transactional provenance may be less precise than the na ¨ıve approach,because informa-tion about intermediate states of the database is discarded,but the user has control over what provenance information is retained,so can use shorter transactions as necessary to describe the exact construction process.The storage cost for the provenance of a transaction is the number of nodes touched in the input and output of the transaction.That is,the number of transactional prove-nance records produced by an update transaction t is i +d +c ,where i is the number of inserted nodes in the output,d is the number of nodes deleted from the input,and c is the number of copied nodes in the output.2.1.3Hierarchical provenanceWhether or not transactional provenance is used,much of the provenance information tends to be redundant (see Figure 5(a,b)),since in many cases the annotation of a child node can be inferred from its parent’s annotation.Accord-ingly,we consider a second technique,called hierarchical provenance .The key observation is that we do not need to store all of the provenance links explicitly,because the provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule.Thus,in(a)ProvT id T o121T/c1/yI⊥123T/c2C S1/a2/x 124T/c2/yI⊥126T/c3C S1/a3/x 126T/c3/yC S2/b3/y 128T/c4C S2/b2 129T/c4/xI⊥(b)ProvT id T o121T/c1/yC S1/a2 121T/c2/xI⊥121T/c3C S1/a3/x 121T/c3/yC S2/b2 121T/c4/xI⊥(c)HProvT id T o121T/c1/yI⊥123T/c2I⊥125T/c3C S1/a3 127T/c3/yI⊥129T/c4I⊥(d)HProvT id T o121T/c1/yC S1/a2 121T/c2/yC S2/b3/y 121T/c4I⊥Figure5:The provenance tables for the update op-eration of Figure2.(a)One transaction per line.(b) Entire update as one transaction.(c)Hierarchical version of(a).(d)Hierarchical version of(b). hierarchical provenance we store only the provenance links that cannot be so inferred.These non-inferable links corre-spond to the provenance links shown in Figure4.Insertions and deletions are treated as for na¨ıve provenance,while a copy-paste operation copy p into q results in adding only a single record HProv(t,C,q,p).Figure5(c)shows the hier-archical provenance table HProv corresponding to the na¨ıve version of Prov.In this case,the reduced table is about 25%smaller than Prov,but much larger savings are possible when entire records or subtrees are copied with little change. Unlike transactional provenance,hereditary provenance does not lose any information and does not require any user interaction.We can define the full provenance table as a view of the hierarchical table as follows.If the provenance is specified in HProv,then it is just copied into Prov.Oth-erwise,the provenance of every target path p/a not men-tioned in HProv is q/a,provided p was copied from q.If p was inserted,then we assume that p/a was also inserted; that is,children of inserted nodes are assumed to also have been inserted,unless there is a record in HProv indicating otherwise.Formally,the full provenance table Prov can be defined in terms of HProv as the following recursive query: Infer(t,p)←¬(∃t,x,q.HProv(t,x,p,q))Prov(t,op,p,q)←HProv(t,op,p,q).Prov(t,C,p/a,q/a)←Prov(t,C,p,q),Infer(t,p).Prov(t,I,p/a,⊥)←Prov(t,I,p,⊥),Infer(t,p).We have to use an auxiliary table Infer to identify the nodes that have no explicit provenance in HProv,to ensure that only the provenance of the closest ancestor is used.In our implementation,Prov is calculated from HProv as necessary for paths in T,so this check is unnecessary.It is not difficult to show that an update sequence U can be described by a hierarchical provenance table with|U|entries.2.1.4Transactional-hierarchical provenance Finally,we considered the combination of transactional and hierarchical provenance techniques;there is little diffi-culty in combining them.Figure5(d)shows the transactional-hierarchical provenance of the transaction in Figure3.It is also easy to show that the storage of transactional-hierarchical provenance is i+d+C,where i and d are defined as in the discussion of transactional provenance and C is the number of roots of copied subtrees that appear in the output.This is bounded above by both|U|and i+d+c,so transactional-hierarchical provenance may be more concise than either approach alone.2.2Provenance queriesHow can we use the machinery developed in the previ-ous section to answer some practical questions about data? Consider some simple questions:Src What transactionfirst created a node?This is particu-larly useful in the case of leaf data;e.g.,who enteredyour telephone number incorrectly?Hist What is the sequence of all transactions that copied a node to its current position?Mod What transactions were responsible for the creation or modification of the subtree under a node?For exam-ple,you would like the complete history of some entryin a database.Hist and Mod provide very different information.A subtree may be copied many times without being modified.Wefirst define some convenient views of the raw Prov table(which,of course,may also be a view derived from HProv).We define the views Unch(t,p),Ins(t,p),Del(t,p), and Copy(t,p,q),which intuitively mean“p was unchanged, inserted,deleted,or copied from q during transaction t,”respectively.Unch(t,p)←¬(∃x,q.Prov(t,x,p,q)).Ins(t,p)←Prov(t,I,p,⊥)Del(t,p)←Prov(t,D,⊥,p)Copy(t,p,q)←Prov(t,C,p,q)We also consider a node p to“come from”q during transac-tion t(table From(t,p,q))provided it was either unchanged (and p=q)or p was copied from q.From(t,p,q)←Copy(t,p,q)From(t,p,p)←Unch(t,p)Next,we define a Trace(p,t,q,u),which says that the data at location p at the end of transaction t“came from”the data at location q at the end of transaction u.Trace(p,t,p,t).Trace(p,t,q,u)←Trace(p,t,r,s),Trace(r,s,q,u). Trace(p,t,q,t−1)←From(t,p,q).Note that Trace is essentially the reflexive,transitive closure of From.Now to define the queries mentioned at the begin-ning of the section,suppose that t now is the last transactionnumber in Prov,and defineSrc(p)={u|∃q.Trace(p,t now,q,u),Ins(u,q)}Hist(p)={u|∃q.Trace(p,t now,q,u),Copy(u,q)}Mod(p)={u|∃q.p≤q,Trace(q,t now,r,u),¬Unch(u,r)} That is,Src(p)returns the number of the transaction that inserted the node now at p,while Hist(p)returns all trans-action numbers that were involved in copying the data now at p.Finally,Mod(p)returns all transaction numbers that modified some data under p.This set could then be com-bined with additional information about transactions to iden-tify all users that modified the subtree at p.Here,p≤q means p is a prefix of q.Despite the fact that there may be infinitely many paths q extending p,the answer Mod(p) is stillfinite,since there are onlyfinitely many transaction identifiers in Prov.The point of this discussion is to show that provenance mappings relating a sequence of versions of a database can be used to answer a wide variety of queries about the evo-lution of the data,even without cooperation from source databases.However,if only the target database tracks prove-nance,the information is necessarily partial.For example, the Src query above cannot tell us anything about data that was copied from elsewhere.Similarly,the Hist and Mod queries stop following the chain of provenance of a piece of data when it exits T.If we do not assume that all the databases involved track provenance and publish it in a con-sistent form,many queries only have incomplete answers.Of course,if source databases also store provenance,we can provide more complete answers by combining the prove-nance information of all of the databases.In addition,there are queries which only make sense if several databases track provenance,such as:Own What is the history of“ownership”of a piece of data?That is,what sequence of databases contained the pre-vious copies of a node?It would be extremely useful to be able to provide answers to such queries to scientists who wish to evaluate the quality of data found in scientific databases.3.IMPLEMENTATIONWe have implemented a“copy-paste database”CPDB that tracks the provenance of data copied from external sources to the target database.In order to demonstrate the flexibility of our approach,our system connects several dif-ferent publically downloadable databases.We have chosen to use MiMI[18],a biological database of curated datasets, as our target database(T in Figure2).MiMI is a protein in-teraction database that runs on Timber[15],a native XML database.We used OrganelleDB[25],a database of protein localization information built on MySQL,as an example of a source database.Since the target database interacts with only one source database at a time,we only experimented with one source database.3.1OverviewCPDB permits the user to connect to the external databases, copy source data into the target database,and modify the data tofit the target database’s structure.The user’s ac-tions are intercepted and the resulting provenance informa-tion is recorded in a provenance store.Currently,CPDBReturns a tree,with unique identifiers,is responsible for determining how therelational database to tree format.Returns a list of nodes that a user hasthe list is size1.Otherwise,each nodecontained in the list.Each node containsInserts a new,empty node with (String nodename)according to the database’s mapping deleteNode()target database.Insert node X as a child of theto database schema mapping.Figure6:Wrappers for Source and Target Databasesprovides a minimal Web interface for testing purposes.Pro-viding a more user-friendly browsing/editing interface is im-portant,but orthogonal to the data management issues that are our primary concern.In order to allow the user to select pertinent information from the source and target databases,each database must be wrapped in a way that allows CPDB to extract the appropri-ate information.This wrapping is essentially the same as a “fully-keyed”XML view of the underlying data.In addition, the target database must also expose particular methods to allow for easy updating.Figure6describes the necessary functions that the source and target databases must imple-ment.Essentially,the source and target databases must provide methods that map tree paths to the database’s na-tive data;in addition,the target database must be able to translate updates to the tree to updates to its internal data. This approach does not require that any of the source or target databases represent data internally as XML.Any underlying data model for which path addresses make sense can be used.Also,the databases need not expose all of their data.Instead,it is up to the databases’administrators how much data to expose for copying or updating.In many cases, the data in scientific databases consists of a“catalog”rela-tion that contains all the raw data,together with supporting cross-reference tables.Typically,it is only this catalog that would need to be made available by a source database. 3.2Implementation of provenance tracking Given wrapped source and target databases,CPDB main-tains a provenance store that allows us to track any changes made to the target database incorporating data from the sources.To this end,during a copy-paste transaction,we write the data values to the target database,and write the provenance information to the provenance store.A user may specify any of the storage operations discussed in the previ-ous section.In this section,we discuss how the implemen-tations of provenance tracking and the Src,Hist,and Mod provenance queries differ from the idealized forms presented in Section2.。
The Importance of Data Privacy
The Importance of Data Privacy Data privacy is a crucial issue in today's digital age, where personal information is constantly being collected, stored, and shared by various entities. The importance of data privacy cannot be overstated, as it is essential for protecting individuals' rights and ensuring their safety in an increasingly interconnected world. From social media platforms to online shopping websites, our personal data is constantly being tracked and used for various purposes, often without our knowledge or consent. One of the primary reasons why data privacy is so important is because it is directly linked to individuals' autonomy and freedom. When our personal information is collected and used without our consent, it can lead to a loss of control over our own lives. For example, companies may use our data to target us with personalized advertisements or manipulate our behavior without us even realizing it. This can have serious implications for our decision-making processes and ultimately undermine our ability to make informed choices. Moreover, data privacy is also essential for protecting individuals' security and preventing identity theft and fraud. When sensitive information such as our financial details or social security numbers are exposed to unauthorized parties,it can result in devastating consequences for our personal and financial well-being. Hackers and cybercriminals are constantly on the lookout forvulnerabilities in systems to exploit and steal valuable data, putting individuals at risk of becoming victims of cybercrime. In addition to protecting individuals' rights and security, data privacy is also crucial for maintaining trust in the digital economy. When consumers feel that their personal information is being handled responsibly and ethically by companies, they are more likely to engagewith online services and make purchases. On the other hand, when data breaches and privacy violations occur, it can erode trust and lead to a loss of confidence in the digital ecosystem as a whole. This can have far-reaching consequences for businesses, as they may struggle to retain customers and attract new ones. From a regulatory perspective, data privacy laws and regulations play a critical role in safeguarding individuals' rights and holding organizations accountable for howthey handle personal information. In recent years, there has been a growing pushfor stronger data protection measures, such as the General Data ProtectionRegulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States. These laws aim to give individuals more control over their data and impose strict requirements on companies to ensure that they are transparent and accountable in their data practices. Overall, the importance of data privacy cannot be understated in today's digital landscape. It is essential for protecting individuals' autonomy, security, and trust in the digital economy. As technology continues to advance and data collection becomes more pervasive, it is crucialthat we prioritize data privacy and advocate for stronger protections to safeguard our personal information. By raising awareness about the importance of data privacy and advocating for stronger regulations, we can work towards creating a safer and more secure digital environment for all.。
雅思小作文数据重复
雅思小作文数据重复I don't have emotions, but I will strive to generate a response that meets the requirements provided.Data repetition is a common issue in the IELTS writing task, especially in the small essay section. This problem can occur due to limited vocabulary, lack of creativity, or a narrow range of ideas. It can affect the overall quality of the essay and may lead to a lower score. In this response, I will explore the causes and consequences of data repetition in IELTS small essays, as well as provide some strategies to overcome this issue.One of the main reasons for data repetition in IELTS small essays is the limited vocabulary of the candidates. Many test-takers rely on a small set of words and phrases, which leads to the repetition of the same ideas and concepts throughout the essay. This can make the essay sound monotonous and uninteresting, and it may fail to demonstrate the candidate's full language proficiency. Additionally, some candidates may struggle to find synonyms for commonly used words, which further contributes to data repetition.Another factor that contributes to data repetition in IELTS small essays is the lack of creativity. Some candidates may find it challenging to come up with new and original ideas for each essay topic. As a result, they may end up repeating the same arguments and examples in different essays, which can negatively impact the coherence and cohesion of their writing. This lack of creativity can also make the essay less engaging for the reader, leading to a lower score in the task response criterion.Furthermore, a narrow range of ideas can also lead to data repetition in IELTS small essays. Some candidates may have a limited understanding of the essay topics or may struggle to generate diverse perspectives on the given issue. As a result, they may end up repeating the same points and arguments throughout the essay,which can hinder the development of their ideas and the overall quality of their writing.The consequences of data repetition in IELTS small essays can be significant. Firstly, it can affect the coherence and cohesion of the essay, making itdifficult for the reader to follow the candidate's line of thought. This can lead to a lower score in the coherence and cohesion criterion, as the essay may lack logical progression and organization. Additionally, data repetition can make the essay less engaging and persuasive, which can impact the candidate's score in the task response criterion. Overall, data repetition can hinder the candidate'sability to effectively communicate their ideas and arguments, which can result in a lower score in the writing task.To overcome the issue of data repetition in IELTS small essays, candidates can employ several strategies. Firstly, they can work on expanding their vocabulary by learning new words and phrases and practicing their use in different contexts. This can help them avoid using the same words and ideas repeatedly and make their writing more varied and engaging. Additionally, candidates can focus on developing their creativity by brainstorming multiple perspectives on the given essay topic and exploring different examples and arguments. This can help them generate original ideas and avoid data repetition in their writing.Furthermore, candidates can improve their understanding of essay topics by reading widely and staying informed about current issues. This can help them develop a broader range of ideas and perspectives, which can enrich their writing and prevent data repetition. Additionally, candidates can practice writing essays on a variety of topics to hone their skills in generating diverse arguments and examples. This can help them become more confident in their ability to produce original and engaging essays for the IELTS writing task.In conclusion, data repetition is a common issue in IELTS small essays, and it can stem from a limited vocabulary, lack of creativity, and a narrow range of ideas. This problem can impact the overall quality of the essay and may lead to alower score in the writing task. However, candidates can overcome this issue by expanding their vocabulary, developing their creativity, and broadening their understanding of essay topics. By employing these strategies, candidates can produce more varied and engaging essays that effectively communicate their ideas and arguments, leading to a higher score in the IELTS writing task.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Data Provenance:Some Basic IssuesPeter Buneman,Sanjeev Khanna,and Wang-Chiew TanUniversity of PennsylvaniaAbstract.The ease with which one can copy and transform data onthe Web,has made it increasingly difficult to determine the origins of apiece of data.We use the term data provenance to refer to the processof tracing and recording the origins of data and its movement betweendatabases.Provenance is now an acute issue in scientific databases whereit is central to the validation of data.In this paper we discuss some ofthe technical issues that have emerged in an initial exploration of thetopic.1IntroductionWhen youfind some data on the Web,do you have any information about how it got there?It is quite possible that it was copied from somewhere else on the Web,which,in turn may have also been copied;and in this process it may well have been transformed and edited.Of course,when we are looking for a best buy,a news story,or a movie rating,we know that what we are getting may be inaccurate,and we have learned not to put too much faith in what we extract from the Web.However,if you are a scientist,or any kind of scholar,you would like to have confidence in the accuracy and timeliness of the data that you are working with.In particular,you would like to know how it got there.In its brief existence,the Web has completely changed the way in which data is circulated.We have moved very rapidly from a world of paper documents to a world of on-line documents and databases.In particular,this is having a profound effect on how scientific research is conducted.Let us list some aspects of this transformation:–A paper document is essentially unmodifiable.To“change”it one issues a new edition,and this is a costly and slow process.On-line documents,by contrast,can be(and often are)frequently updated.–On-line documents are often databases,which means that they have explicit structure.The development of XML has blurred the distinction between documents and databases.–On-line documents/databases typically contain data extracted from other documents/databases through the use of query languages or“screen-scrap-ers”.Among the sciences,thefield of Molecular Biology is possibly one of the most sophisticated consumers of modern database technology and has generated S.Kapoor and S.Prasad(Eds.):FST TCS2000,LNCS1974,pp.87–93,2000.c Springer-Verlag Berlin Heidelberg200088Peter Buneman,Sanjeev Khanna,and Wang-Chiew Tana wealth of new database issues[15].A substantial fraction of research in ge-netics is conducted in “dry”laboratories using in silico experiments –analysis of data in the available databases.Figure 1shows how data flows through a very small fraction of the available molecular biology databases 1.In all but one case,there is a Lit –for literature –input to a database indicating that this is database is curated .The database is not simply obtained by a database query or by on-line submission,but involves human intervention in the form of addi-tional classification,annotation and error correction.An interesting property of this flow diagram is that there is a cycle in it.This does not mean that there is perpetual loop of possibly inaccurate data flowing through the system (though this might happen);it means that the two databases overlap in some area and borrow on the expertise of their respective curators.The point is that it may now be very difficult to determine where a specific piece of data comes from.We use the term data provenance broadly to refer to a description of the origins of a piece of data and the process by which it arrived in a database.Most im-plementors and curators of scientific databases would like to record provenance,but current database technology does not provide much help in this process for databases are typically rather rigid structures and do not allow the kinds of ad hoc annotations that are often needed for recording provenance.Fig.1.The Flow of Data in BioinformaticsThe databases used in molecular biology form just one example of why data provenance is an important issue.There are other areas in which it is equally acute [5].It is an issue that is certainly broader than computer science,with legal 1Thanks to Susan Davidson,Fidel Salas and Chris Stoeckert of the Bioinformatics Center at Penn for providing this information.Data Provenance:Some Basic Issues89 and ethical aspects.The question that computer scientists,especially theoretical computer scientists,may want to ask is what are the technical issues involved in the study of data provenance.As in most areas of computer science,the hard part is to formulate the problem in a concise and applicable fashion.Once that is done,it often happens that interesting technical problems emerge.This abstract reviews some of the technical issues that have emerged in an initial exploration.2Computing Provenance:Query InversionPerhaps the only area of data provenance to receive any substantial attention is that of provenance of data obtained via query operations on some input databases.Even in this restricted setting,a formalization of the notion of data provenance turns out to be a challenging problem.Specifically,given a tuple t in the output of a database query Q applied on some source data D,we want to understand which tuples in D contributed to the output tuple t,and if there is a compact mechanism for identifying these input tuples.A natural approach is to generate a new query Q ,determined by Q,D and t,such that when the query Q is applied to D,it generates a collection of input tuples that“contributed to”the output tuple t.In other words,we would like to identify the provenance by inverting the original query.Of course,we have to ask what we mean by con-tributed to?This problem has been studied under various names including“data pedigree”and“data lineage”in[1,9,7].One way we might answer this question is to say that a tuple in the input database“contributes to”an output tuple if changing the input tuple causes the output tuple to change or to disappear from the output.This definition breaks down on the simplest queries(a projection or union).A better approach is to use a simple proof-theoretic definition.If we are dealing with queries that are expressible in positive relational algebra(SPJU) or more generally in positive datalog,we can say that an input tuple(a fact)“contributes to”an output tuple if it is used in some minimal derivation of that tuple.This simple definition works well,and has the expected properties:it is invariant under query rewriting,and it is compositional in the expected way. Unfortunately,these desirable properties break down in the presence of negation or any form of aggregation.To see this consider a simple SQL query: SELECT name,telephoneFROM employeeWHERE salary>SELECT AVERAGE salary FROM employeeHere,modifying any tuple in the employee relation could affect the presence of any given output tuple.Indeed,for this query,the definition of“contributes to”given in[9]makes the whole of the employee relation contribute to each tuple in the output.While this is a perfectly reasonable definition,the properties of invariance under query rewriting and compositionality break down,indicating that a more sophisticated definition may be needed.Before going further it is worth remarking that this characterization of prove-nance is related to the topics of truth maintenance[10]and view maintenance90Peter Buneman,Sanjeev Khanna,and Wang-Chiew Tan[12].The problem in view maintenance is as follows.Suppose a database(a view) is generated by an expensive query on some other database.When the source database changes,we would like to recompute the view without recomputing the whole query.Truth maintenance is the same problem in the terminology of deductive systems.What may make query inversion simpler is that we are only interested in what is in the database;we are not interested in updates that would add tuples to the database.In[7]another notion of provenance is introduced.Consider the SQL query above,and suppose we see the tuple("John Doe",12345)in the output.What the previous discussion tells us is why that tuple is in the output.However,we might ask an apparently simpler question:given that the tuple appears in the output,where does the telephone number12345come from?The answer to this seems easy–from the"John Doe"tuple in the input.This seems to imply that as long as there is some means of identifying tuples in the employee relation, one can compute where-provenance by tracing the variable(that emits12345) of the query.However,this intuition is fragile and a general characterization is not obvious;it is discussed in[7].We remark that this second form of provenance,where-provenance,is also related to the view update problem[3]:if John Doe decides to change his tele-phone number at the view,which data should be modified in the employee relation?Again,where-provenance seems simpler because we are only interested in modifications to the existing view;we are not interested in insertions to the view.Another issue in query inversion is to capture other query languages and other data models.For example,we would like to describe the problem in object-oriented[11]or semistructured data models[2](XML).What makes these models interesting is that we are no longer operating at thefixed level of tuples in the relational model.We may want to ask for the why-or where-provenance of some deeply nested component of some structure.To this end,[7]studies the issue of data provenance in a“deterministic”model of semistructured data in which every element has a canonical path or identifier.Work on view maintainence based on this model has also been studied in[14].This leads us to our next topics,those of citing and archiving data.3Data CitationA digital library is typically a large and heterogeneous collection of on-line docu-ments and databases with sophisticated software for exploring the collection[13]. However many digital libraries are also being organized so that they serve as scholarly resources.This being the case,how do we cite a component of a digital library.Surprisingly,this topic has received very little attention.There appear to be no generally useful standards for citations.Well organized databases are constructed with keys that allow us uniquely to identify a tuple in a relation. By giving the attribute name we can identify a component of a tuple,so there is usually a canonical path to any component of the database.Data Provenance:Some Basic Issues91 How we cite portions of documents,especially XML documents is not soclear.A URL provides us with a universal locator for a document,but howare we to proceed once we are inside the document?Page numbers and line numbers–if they exist–are friable,and we have to remember that an XMLdocument may now represent a database for which the linear document structure is irrelevant.There are some initial notions of keys in the XML standard[4]and in the XML Schema proposals[16].In the XML Document Type Descriptor(DTD)one can declare an ID attribute.Values for this attribute are to be unique in the document and can be used to locate elements of the document.Howeverthe ID attribute has nothing to do with the structure of the document–it issimply a user-defined identifier.In XML-Schema the definition of a key relies on XPath[8],a path descriptionlanguage for XML.Roughly speaking a key consists of two paths through the data.Thefirst is a path,for example Department/Employee,that describes theset of nodes upon which a key constraint is to be imposed.This is called the target set.The second is another path,for example IdCard/Number that uniquelyidentifies nodes in the target set.This second part is called the key path,andthe rule is that two distinct nodes in the target set must have different values at the end of their key paths.Apart from some details and the fact that XPathis probably too complex a language for key specification,this definition is quiteserviceable,but it does not take into account the hierarchical structure of keys that are common in well-organized databases and documents.To give an example of what is needed,consider the problem of citing a part of a bible,organized by chapter,book and verse.We might start withthe idea that books in the bible are keyed by name,so we use the pair of paths (Bible/Book,Name).We are assuming here that Bible is the unique root.Now we may want to indicate that chapters are specified by number,but it wouldbe incorrect to write(Bible/Book/Chapter,Number)because this says that that chapter numbers are unique within the bible.Instead we need to specify a relative key which consists of a triple,(Bible/Book,Chapter,Number).What this means is that the(Chapter,Number)key is to hold at every node specified by by the path Bible/Book.A more detailed description of relative keys is given in[6].While some basic inference results are known,there is a litany of open questions surrounding them:What are appropriate path languages for the various components of a key?What inference results can be established for these languages?How do we specify foreign keys,and what results hold for them?What interactions are there between keys and DTDs.These are practical questions that will need to be answered if,as we do in databases,use keys as the basis for indexing and query optimization.92Peter Buneman,Sanjeev Khanna,and Wang-Chiew Tan4Archiving and Other Problems Associated with ProvenanceLet us suppose that we have a good formulation,or even a standard,for data citation,and that document A cites a(component of a)document B.Whose responsibility is it to maintain the integrity of B?The owner of B may wish to update it,thereby invalidating the citation in A.This is a serious problem in scientific databases,and what is commonly done is to release successive versions of a database as separate documents.Since one version is–more or less–an extension the previous version,this is wasteful of space and the space overhead limits the rate at which one can release versions.Also,it is difficult when the history of a database is kept in this form to trace the history of components of the database as defined by the key structure.There are a number of open questions:–Can we compress versions so that the history of A can be efficiently recorded?–Should keeping the cited data be the responsibility of A rather than B?–Should Bfigure out what is being cited and keep only those portions?In this context it is worth noting that,when we cite a URL,we hardly ever give a date for the citation.If we did this,at least the person who follows the citation will know whether to question the validity of the citation by comparing it with the timestamp on the URL.Again,let us suppose that we have an agreed standard for citations and that,rather than computing provenance by query inversion(which is only possi-ble when the data of interest is created by a query,)we decide to annotate each element in the database with one or more citations that describes its provenance. What is the space overhead for doing this?Given that the citations have struc-ture and that the structure of the data will,in part,be related to the structure of the data,one assumes that some form of compression is possible.Finally,one is tempted to speculate that we may need a completely different model of data exchange and databases to characterize and to capture provenance. One could imagine that data is exchanged in packages that are“self aware”2and somehow contain a complete history of how they moved through the system of databases,of how they were constructed,and of how they were changed.The idea is obviously appealing,but whether it can be formulated clearly,let alone be implemented,is an open question.References[1] A.Woodruffand M.Stonebraker.Supportingfine-grained data lineage in adatabase visualization environment.In ICDE,pages91–102,1997.[2]Serge Abiteboul,Peter Buneman,and Dan Suciu.Data on the Web.From Rela-tions to Semistructured Data and XML.Morgan Kaufman,2000.2A term suggested by David MaierData Provenance:Some Basic Issues93 [3]T.Barsalou,N.Siambela,A.Keller,and G Wiederhold.Updating relationaldatabases through object-based views.In Proceedings ACM SIGMOD,May1991.[4]Tim Bray,Jean Paoli,and C.M.Sperberg-McQueen.Extensible MarkupLanguage(XML) 1.0.World Wide Web Consortium(W3C),Feb1998./TR/REC-xml.[5]P.Buneman,S.Davidson,M.Liberman,C.Overton,and V.Tannen.Data prove-nance./∼wctan/DataProvenance/precis/index.html.[6]Peter Buneman,Susan Davidson,Carmem Hara,Wenfei Fan,and Wang-ChiewTan.Keys for XML.Technical report,University of Pennsylvania,2000..[7]Peter Buneman,Sanjeev Khanna,and Wang-Chiew Tan.Why and Where:ACharacterization of Data Provenance.In International Conference on Database Theory,2001.To appear,available at .[8]James Clark and Steve DeRose.XML Path Language(XPath).W3C WorkingDraft,November1999./TR/xpath.[9]Y.Cui and J.Widom.Practical lineage tracing in data warehouses.In ICDE,pages367–378,2000.[10]Jon Doyle.A truth maintenance system.Artificial Intelligence,12:231–272,1979.[11]R.G.G.Cattell et al,editor.The Object Database Standard:Odmg2.0.MorganKaufmann,1997.[12] A.Gupta and I.Mumick.Maintenance of materialized views:Problems,tech-niques,and applications.IEEE Data Engineering Bulletin,Vol.18,No.2,June 1995.,1995.[13]Michael Lesk.Practical Digital Libraries:Books,Bytes and Bucks,.MorganKaufmann,July1997.[14]Hartmut Liefke and Susan Davidson.View maintenance for hierarchical semistruc-tured data.In International Conference on Data Warehousing and Knowledge Discovery,2000.[15]Susan Davidson and Chris Overton and Peter Buneman.Challenges in IntegratingBiological Data Sources.Journal of Computational Biology,2(4):557–572,Winter 1995.[16]World Wide Web Consortium(W3C).XML Schema Part0:Primer,2000./TR/xmlschema-0/.。