Different Aspects of Web Log Mining

合集下载

高考英语句子结构复杂句分析练习题30题

高考英语句子结构复杂句分析练习题30题

高考英语句子结构复杂句分析练习题30题1<背景文章>In today's rapidly evolving technological landscape, a revolutionary new technology has emerged that is set to transform multiple industries. This technology, known as quantum computing, operates on principles that are vastly different from traditional computing.Quantum computing harnesses the power of quantum mechanics to perform calculations at speeds unimaginable with classical computers. At the heart of quantum computing is the qubit, a quantum version of the bit used in traditional computing. While a bit can be in one of two states, 0 or 1, a qubit can exist in multiple states simultaneously, thanks to a phenomenon called superposition.This unique property of qubits allows quantum computers to process vast amounts of information simultaneously. For example, in solving complex optimization problems, quantum computing can explore multiple solutions simultaneously, significantly reducing the time required to find the optimal solution.Another key aspect of quantum computing is entanglement. Entangled qubits are connected in such a way that the state of one qubit is instantly affected by the state of another, regardless of the distance betweenthem. This property enables quantum computers to perform certain calculations with remarkable efficiency.The applications of quantum computing are wide-ranging. In the field of cryptography, quantum computers have the potential to break existing encryption methods, forcing the development of new, quantum-resistant encryption techniques. In drug discovery, quantum computing can accelerate the process of simulating molecular interactions, leading to the development of more effective drugs. Additionally, quantum computing can improve weather forecasting, optimize logistics and supply chain management, and enhance financial modeling.As quantum computing continues to advance, it is likely to have a profound impact on our lives. However, there are still many challenges to overcome. One of the major challenges is the need for extremely low temperatures and stable environments to maintain the delicate quantum states. Another challenge is the development of error correction techniques to ensure the accuracy of quantum calculations.Despite these challenges, researchers and engineers around the world are working tirelessly to unlock the full potential of quantum computing. With continued investment and innovation, quantum computing may soon become a mainstream technology, changing the way we live and work.1. The main difference between a qubit and a bit is that a qubit _____.A. can only be in one stateB. can exist in multiple states simultaneouslyC. is used in traditional computingD. is slower than a bit答案:B。

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic

Web Usage Mining Using Artificial Ant Colony Clustering and Linear Genetic ProgrammingAjith AbrahamDepartment of Computer Science, Oklahoma State University, Tulsa, OK 74106, USAaa@Vitorino RamosCVRM-GeoSystems Centre, Technical University of Lisbon, Portugal vitorino.ramos@alfa.ist.utl.ptAbstract- The rapid e-commerce growth has made both business community and customers face a new situation. Due to intense competition on the one hand and the customer’s option to choose from several alternatives, the business community has realized the necessity of intelligent marketing strategies and relationship management. Web usage mining attempts to discover useful knowledge from the secondary data obtained from the interactions of the users with the Web. Web usage mining has become very critical for effective Web site management, creating adaptive Web sites, business and support services, personalization, network traffic flow analysis and so on. The study of ant colonies behavior and their self-organizing capabilities is of interest to knowledge retrieval/ management and decision support systems sciences, because it provides models of distributed adaptive organization, which are useful to solve difficult optimization, classification, and distributed control problems, among others [16][17][18]. In this paper, we propose an ant clustering algorithm to discover Web usage patterns (data clusters) and a linear genetic programming approach to analyze the visitor trends. Empirical results clearly show that ant colony clustering performs well when compared to a self-organizing map (for clustering Web usage patterns) even though the performance accuracy is not that efficient when compared to evolutionary-fuzzy clustering (i-miner) [1] approach.1 IntroductionThe WWW continues to grow at an amazing rate as an information gateway and as a medium for conducting business. Web mining is the extraction of interesting and useful knowledge and implicit information from artefacts or activity related to the WWW [12][7]. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the Web access logs can help understand the user behaviour and the web structure. From the business and applications point of view, knowledge obtained from the Web usage patterns could be directly applied to efficiently manage activities related to e-business, e-services, e-education and so on. Accurate Web usage information could help to attract new customers, retain current customers, improve cross marketing/sales, effectiveness of promotional campaigns, track leaving customers and find the most effective logical structure for their Web space. User profiles could be built by combining users’ navigation paths with other data features, such as page viewing time, hyperlink structure, and page content [9]. What makes the discovered knowledge interesting had been addressed by several works. Results previously known are very often considered as not interesting. So the key concept to make the discovered knowledge interesting will be its novelty or appearance of unexpectedness.There are several commercial softwares that could provide Web usage statistics. These stats could be useful for Web administrators to get a sense of the actual load on the server. For small web servers, the usage statistics provided by conventional Web site trackers may be adequate to analyze the usage pattern and trends. However as the size and complexity of the data increases, the statistics provided by existing Web log file analysis tools may prove inadequate and more intelligent mining techniques will be necessary [10].A generic Web usage mining framework is depicted in Figure 1. In the case of Web mining data could be collected at the server level, client level, proxy level or some consolidated data. These data could differ in terms of content and the way it is collected etc. The usage data collected at different sources represent the navigation patterns of different segments of the overall Web traffic, ranging from single user, single site browsing behaviour to multi-user, multi-site access patterns. As evident from Figure 1, the Web server log does not contain sufficient accurate information for infering the behaviour at the client side as they relate to the pages served by the Web server. Pre-processed and cleaned data could be used for pattern discovery, pattern analysis, Web usage statistics and generating association/ sequential rules. Much work has been performed on extracting various pattern information from Web logs and the application of the discovered knowledge range from improving the design and structure of a Web site to enabling business organizations to function more efficiently. Jespersen et al [10] proposed an hybrid approach for analyzing the visitor click sequences. A combination of hypertext probabilistic grammar and click fact table approach is used to mine Web logs which could be also used for general sequence mining tasks. Mobasher et al [14] proposed the Web personalization system which consists of offline tasks related to the mining of usage data and online process of automatic Web page customization based on the knowledge discovered. LOGSOM proposed by Smith et al [19], utilizes a self-organizing map to organize web pages into a two-dimensional map based solely on the users' navigation behavior, rather than the content of the web pages. LumberJack proposed by Chi et al [6] builds up user profiles by combining both user session clustering and traditional statistical traffic analysis using K-means algorithm.Sequential/associationFigure 1. Web usage mining frameworkJoshi et al [11] used relational online analytical processing approach for creating a Web log warehouse using access logs and mined logs (association rules and clusters). A comprehensive overview of Web usage mining research is found in [7][20].In this paper, an ant colony clustering (ACLUSTER) [16] is proposed to segregate visitors and thereafter a linear genetic programming approach [3] to analyze the visitor trends. The results are compared with the earlier works using self organizing map [21] and evolutionary - fuzzy c means algorithm [1] to seggregate the user access records and several soft computing paradigms to analyze the user access trends.Figure 2. University’s daily Web traffic pattern for 5 weeks [15]Figure 3. Average hourly Web traffic patterns for 5 weeks [15]Web access log data at the Monash University’s Web site [15] were used for experimentations. The University’s central web server receives over 7 million hits in a week and therefore it is a real challenge to find and extract hidden usage pattern information. Average daily and hourly access patterns for 5 weeks (11 August’02 – 14 September’02) are shown in Figures 2 and 3 respectively. The average daily and hourly patterns nevertheless tend to follow a similar trend (as evident from the figures) the differences tend to increase during high traffic days (Monday – Friday) and during the peak hours (11:00-17:00 Hrs). Due to the enormous traffic volume and chaotic access behavior, the prediction of the user access patterns becomes more difficult and complex.In the subsequent section, we present the proposed architecture and experimentation results of the ant clustering – linear genetic programming approach. Some conclusions are provided towards the end.2. Hybrid Framework Using Ant ColonyClustering and Linear GeneticProgramming Approach (ANT-LGP)The hybrid framework uses an ant colony optimization algorithm to cluster Web usage patterns. The raw data from the log files are cleaned and pre-processed and the ACLUSTER algorithm [16] is used to identify the usage patterns (data clusters). The developed clusters of data are fed to a linear genetic programming model to analyze the usage trends.2.1 Ant Colony Clustering Using Bio-InspiredSpatial ProbabilitiesIn several species of ants, workers have been reported to sort their larvae or form piles of corpses – literally cemeteries – to clean up their nests. Chrétien [4] has performed experiments with the ant Lasius niger to studythe organization of cemeteries. Other experiments include the ants Pheidole pallidula reported by Denebourg et al.[8]. In Nature many species actually organize a cemetery. If corpses, or more precisely, sufficiently large parts of corpses are randomly distributed in space at the beginning of the experiment, the workers form cemetery clusters within a few hours, following a behavior similar to aggregation. If the experimental arena is not sufficiently large, or if it contains spatial heterogeneities, the clusters will be formed along the edges of the arena or, more generally, following the heterogeneities. The basic mechanism underlying this type of aggregation phenomenon is an attraction between dead items mediated by the ant workers: small clusters of items grow by attracting workers to deposit more items. It is this positive and auto-catalytic feedback that leads to the formation of larger and larger clusters. In this case, it is therefore the distribution of the clusters in the environment that plays the role of stigmergic variable. Denebourg et al. [8] have proposed one model (BM -basic model) to account for the above-mentioned phenomenon of corpse clustering in ants. The general idea is that isolated items should be picked up and dropped at some other location where more items of that type are present. Lumer and Faieta’s(LF) model [13] have generalized Denebourg et al.’s basic method [8], and applied it to exploratory data analysis.Instead of trying to solve some disparities in the basic LF algorithm by adding different ant casts, short-term memories and behavioral switches, which are computationally intensive, representing simultaneously a potential and difficult complex parameter tuning, Ramos et al [16] proposed ACLUSTER algorithm to follow real ant-like behaviors as much as possible. In that sense, bio-inspired spatial transition probabilities are incorporated into the system, avoiding randomly moving agents, which encourage the distributed algorithm to explore regions manifestly without interest (e.g., regions without any type of object clusters), this type of exploration being generally, counterproductive and time consuming. Since this type of transition probabilities depends on the spatial distribution of pheromone across the environment, the behavior reproduced is also a stigmergic one. Moreover, the strategy not only allows guiding ants to find clusters of objects in an adaptive way (if, by any reason, one cluster disappears, pheromone tends to evaporate on that location), as the use of embodied short-term memories is avoided (since this transition probabilities tends also to increase pheromone in specific locations, where more objects are present). As we shall see, the distribution of the pheromone represents the memory of the recent history of the swarm, and in a sense it contains information, which the individual ants are unable to hold or transmit. There is no direct communication between the organisms but a type of indirect communication through the pheromonal field. In fact, ants are not allowed to have any memory and the individual’s spatial knowledge is restricted to local information about the whole colony pheromone density. In order to model the behavior of ants associated with different tasks, as dropping and picking up objects, we suggest the use of combinations of different response thresholds. As we have seen before, there are two major factors that should influence any local action taken by the ant-like agent: the number of objects in his neighborhood, and their similarity (including the hypothetical object carried by one ant). Lumer and Faieta [13] use an average similarity, mixing distances between objects with their number, incorporating it simultaneously into a response threshold function like the one of Denebourg’s [8]. Instead, ACLUSTER uses combinations of two independent response threshold functions, each associated with a different environmental factor (or, stimuli intensity), that is, the number of objects in the area, and their similarity. The computation of average similarities is avoided in the ACLUSTER algorithm, since this strategy could be somehow blind to the number of objects present in one specific neighborhood. Bonabeau et al. [5], proposed a family of response threshold functions in order to model response thresholds. Every individual has a response threshold for every task. Individuals engage in task performance when the level of the task-associated stimuli’s, exceeds their thresholds. Technical details of ACLUSTER could be obtained from [16]2.2 Experimental Setup and Clustering ResultsIn this research, we used the statistical/ text data generated by the log file analyzer from 01 January 2002 to 07 July 2002. Selecting useful data is an important task in the data pre-processing block. After some preliminary analysis, we selected the statistical data comprising of domain byte requests, hourly page requests and daily page requests as focus of the cluster models for finding Web users’ usage patterns. The most recently accessed data were indexed higher while the least recently accessed data were placed at the bottom.For each of datasets (daily and hourly log data ), the algorithm was run twice (for t= 1 to 10,00,000) in order to check if somehow the results were similar (which they appear to be, if we look into which data items are connected into what clusters). The classification space is always 2D, non-parametric and toroidal. Experimentation results for the daily and hourly Web traffic data arepresented in Figures 4 and 5.(a) t =1(b) t = 100(c) t= 500(d) t = 900(e) t = 10,00,000Figure 4 (a-e). The snapshots represent the spatial distribution of daily Web traffic data on a 25 x 25 non-parametric toroidal grid at several time steps. At t =1, data items are randomly allocated into the grid. As time evolves, several homogenous clusters emerge due to the ant colony action. Type 1 probability function was used with k 1= 0.1, k 2= 0.3 and 14 ants. (k 1 and k 2 are threshold constants).(a) t=1(b) t= 100(c) t = 500(d) t= 900(e) t = 10,00,000Figure 5 (a-e). The snapshots represent the spatial distribution of hourly Web traffic data on a 45 x 45 non-parametric toroidal grid at several time steps. At t =1, data items are randomly allocated into the grid. As time volves, several homogenous clusters emerge due to the ant colony action. Type 1 probability function was used with k 1= 0.1, k 2= 0.3 and 48 ants.2.3 Linear Genetic Programming (LGP)Linear genetic programming is a variant of the GP technique that acts on linear genomes [3]. Its main characteristics in comparison to tree-based GP lies in that the evolvable units are not the expressions of a functional programming language (like LISP), but the programs of an imperative language (like c/c ++). An alternate approach is to evolve a computer program at the machine code level, using lower level representations for the individuals. This can tremendously hasten the evolution process as, no matter how an individual is initially represented, finally it always has to be represented as a piece of machine code, as fitness evaluation requires physical execution of the individuals.The basic unit of evolution here is a native machine code instruction that runs on the floating-point processor unit (FPU). Since different instructions may have different sizes, here instructions are clubbed up together to form instruction blocks of 32 bits each. The instruction blocks hold one or more native machine code instructions, depending on the sizes of the instructions. A crossoverpoint can occur only between instructions and is prohibited from occurring within an instruction. However the mutation operation does not have any such restriction.2.4 Experimental Setup and Trend Analysis Results Besides the inputs ‘volume of requests ’ and ‘volume of pages (bytes)’ and ‘index number’, we also used the ‘cluster information’ provided by the clustering algorithm as an additional input variable . The data was re-indexed based on the cluster information. Our task is to predict (a few time steps ahead) the Web traffic volume on a hourly and daily basis. We used the data from 17 February 2002 to 30 June 2002 for training and the data from 01 July 2002 to 06 July 2002 for testing and validation purposes.We used a LGP technique that manipulates and evolves a program at the machine code level. We used the Discipulus workbench for simulating LGP [2]. The settings of various linear genetic programming system parameters are of utmost importance for successful performance of the system. The population space has been subdivided into multiple subpopulation or demes. Migration of individuals among the subpopulations causes evolution of the entire population. It helps to maintain diversity in the population, as migration is restricted among the demes. Moreover, the tendency towards a bad local minimum in one deme can be countered by other demes with better search directions. The various LGP search parameters are the mutation frequency, crossover frequency and the reproduction frequency: The crossover operator acts by exchanging sequences of instructions between two tournament winners. Steady state genetic programming approach was used to manage the memory more effectively. After a trial and error approach, the following parameter settings were used for the experiments.Population size: 500 Tournament size: 4Maximum no. of tournaments : 120,000 Mutation frequency: 90% Crossover frequency: 80% Number of demes: 10Maximum program size: 512 Target subset size: 100The experiments were repeated three times and the test data was passed through the saved model. Figures 6 and 8 illustrate the average growth in program length for hourly and daily Web traffic. Figures 7 and 9 depict the training and test performance (average training and test fitness) for hourly and daily Web traffic. Empirical comparison of the proposed framework with some of our previous work is depicted in Tables 1 and 2. Performance comparison of the proposed framework (ANT-LGP) with i-Miner (hybrid evolutionary fuzzy clustering–fuzzy inference system) [1], self-organizing map–linear genetic programming (SOM-LGP) [21] and self-organizing map–artificial neural network (SOM-ANN) [21] are graphically illustrated in Figures 10 and 11.Figure 6. Hourly Web data analysis: Growth in average program length during 120,000 tournaments.Figure 7. Hourly Web data analysis: Comparison of average training and test fitness during 120,000tournaments.Figure 8. Daily Web data analysis: growth in averageprogram length during 120,000 tournaments.Figure 9. Daily Web data analysis: comparison of average training and test fitness during 120,000 tournaments.Table 1. Performance of the different paradigms for daily Web dataDaily (1 day ahead) RMSE CCHybrid methodTrainTest ANT-LGP 0.0191 0.0291 0.9963 i-Miner (FCM-FIS) 0.0044 0.0053 0.9967 SOM-ANN 0.0345 0.0481 0.9292 SOM-LGP0.05430.07490.9315Table 2. Performance of the different paradigms for hourly Web dataHourly (1 hour ahead) RMSE CCHybrid methodTrainTest ANT-LGP 0.2561 0.035 0.9921 i-Miner (FCM-FIS) 0.0012 0.0041 0.9981 SOM-ANN 0.0546 0.0639 0.9493 SOM-LGP0.06540.05160.94463. ConclusionsThe proposed ANT-LGP model seems to work very well for the problem considered. The empirical results also reveal the importance of using optimization techniques for mining useful information. In this paper, our focus was to develop accurate trend prediction models to analyze the hourly and daily web traffic volume. Useful information could be discovered from the clustered data. The knowledge discovered from the developed clusters using different intelligent models could be a good comparison study and is left as a future research topic.As illustrated in Tables 1 and 2, incorporation of the ant clustering algorithm helped to improve the performance of the LGP model (when compared to clustering using self organizing maps). i-Miner framework gave the overall best results with the lowest RMSE on test error and the highest correlation coefficient (CC).Future research will also incorporate more data mining algorithms to improve knowledge discovery and association rules from the clustered data. The contribution of the individual input variables and different clustering algorithms will be also investigated to improve the trend analysis and knowledge discovery.Figure 10. Comparison of different paradigms for daily Web traffic trendsFigure 11. Comparison of different paradigms for hourly Web traffic trendsBibliography[1] Abraham A., i-Miner : A Web Usage MiningFramework Using Hierarchical Intelligent Systems, The IEEE International Conference on Fuzzy Systems, FUZZ-IEEE'03, pp. 1129-1134, 2003. [2] AIMLearning Technology, <>[3] Banzhaf. W., Nordin. P., Keller. E. R., Francone F.D. , Genetic Programming : An Introduction on The Automatic Evolution of Computer Programs and its Applications , Morgan Kaufmann Publishers, Inc., 1998.[4] Bonabeau E., M. Dorigo, G. Théraulaz ., SwarmIntelligence: From Natural to Artificial Systems. Santa Fe Institute in the Sciences of the Complexity, Oxford University Press, New York, Oxford, 1999. [5] Bonabeau, E., Théraulaz, G., Denebourg, J.-L.,Quantitative Study of the Fixed Response Threshold Model for the Regulation of Division of Labour in Insect Societies, Roy. Soc. B, 263, pp.1565-1569, 1996. [6] Chi E.H., Rosien A. and Heer J., LumberJack:Intelligent Discovery and Analysis of Web UserTraffic Composition. In Proceedings of ACM-SIGKDD Workshop on Web Mining for Usage Patterns and User Profiles, Canada,ACM Press, 2002.[7] Cooley R., Web Usage Mining: Discovery andApplication of Interesting patterns from Web Data, Ph. D. Thesis, Department of Computer Science, University of Minnesota, 2000.[8] Deneubourg, J.-L., Goss, S., Franks, N., Sendova-Franks A., Detrain, C., Chretien, L. The Dynamic of Collective Sorting Robot-like Ants and Ant-like Robots, SAB’90 - 1st Conf. On Simulation of Adaptive Behavior: From Animals to Animats, J.A.Meyer and S.W. Wilson (Eds.), 356-365. MIT Press, 1991.[9] Heer, J. and Chi E.H., Identification of Web UserTraffic Composition using Multi- Modal Clustering and Information Scent, In Proc. of the Workshop on Web Mining, SIAM Conference on Data Mining, pp.51-58, 2001[10] Jespersen S.E., Thorhauge J., and Bach T., A HybridApproach to Web Usage Mining, Data Warehousing and Knowledge Discovery, LNCS 2454, Y. Kambayashi, W. Winiwarter, M. Arikawa (Eds.), pp. 73-82, 2002.[11] Joshi K.P., Joshi A., Yesha Y., Krishnapuram, R.,Warehousing and Mining Web Logs. Proceedings of the 2nd ACM CIKM Workshop on Web Information and Data Management, pp. 63-68, 1999.[12] Kosala R and Blockeel H., Web Mining Research: ASurvey, ACM SIGKDD Explorations, 2(1), pp. 1-15, 2000.[13] Lumer E. D. and Faieta B., Diversity and Adaptationin Populations of Clustering Ants. In Cliff, D., Husbands, P., Meyer, J. and Wilson S. (Eds.), in From Animals to Animats 3, Proc. of the 3rd Int.Conf. on the Simulation of Adaptive Behavior.Cambridge, MA: The MIT Press/Bradford Books, 1994.[14] Mobasher B., Cooley R. and Srivastava J., CreatingAdaptive Web Sites through Usage-based Clustering of URLs, In Proceedings of 1999 Workshop on Knowledge and Data Engineering Exchange, USA, pp.19-25, 1999.[15] Monash University Web site:<.au>[16] Ramos V., Muge F., Pina P., Self-Organized Dataand Image Retrieval as a Consequence of Inter-Dynamic Synergistic Relationships in Artificial Ant Colonies, Soft Computing Systems - Design, Management and Applications, 2nd Int. Conf. on Hybrid Intelligent Systems, IOS Press, pp. 500-509, 2002. [17] Ramos V. and Merelo J. J., Self-OrganizedStigmergic Document Maps: Environment as a Mechanism for Context Learning, in E. Alba, F.Herrera, J.J. Merelo et al. (Eds.), AEB´02 - 1st Int.Conf. On Metaheuristics, Evolutionary and Bio-Inspired Algorithms, pp. 284-293, Mérida, Spain, 2002.[18] Ramos V. and Almeida F., Artificial Ant Colonies inDigital Image Habitats - A Mass Behaviour Effect Study on Pattern Recognition, in Marco Dorigo, Martin Middendorf and Thomas Stüzle (Eds.), Proc.of ANTS'00 – 2nd Int. Workshop on Ant Algorithms, pp. 113-116, Brussels, Belgium, 2000.[19] Smith K.A. and Ng A., Web page clustering using aself-organizing map of user navigation patterns, Decision Support Systems, Volume 35, Issue 2 , pp.245-256, 2003.[20] Srivastava J., Cooley R. Deshpande, M. and Tan,P.N., Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.SIGKDD Explorations, vol. 1, no. 2, pp. 12-23, 2000.[21] Wang X., Abraham A. and Smith K.A, SoftComputing Paradigms for Web Access Pattern Analysis, Proceedings of the 1st International Conference on Fuzzy Systems and Knowledge Discovery, pp. 631-635, 2002.。

四川省泸县2023-2024学年高一上学期12月月考英语试题含答案

四川省泸县2023-2024学年高一上学期12月月考英语试题含答案

泸县高2023级高一上期第三学月考试英语试题(答案在最后)本试卷共8页,满分120分。

考试用时120分钟。

注意事项:1.答题前填写好自己的姓名、班级、考号等信息2.请将答案正确填写在答题卡上第I卷(选择题95分)第一部分听力(共两节,满分30分)第一节(共5小题;每小题1.5分,满分7.5分)听下面5段对话。

每段对话后有一个小题,从题中所给的A、B、C三个选项中选出最佳选项。

听完每段对话后,你都有10秒钟的时间来回答有关小题和阅读下一小题。

每段对话仅读一遍。

1.【此处可播放相关音频,请去附件查看】What is the woman doing?A.Working.B.Apologizing.C.Expressing her thanks.【答案】B【解析】【原文】W:Jerry,I’m sorry.Please forgive me.M:Forget it,Beth.W:I didn’t mean to hurt your feelings.What can I do to make it right between us?M:It doesn’t matter.2.【此处可播放相关音频,请去附件查看】Who might Mary be?A.The woman’s dog.B.The man’s daughter.C.The man’s neighbor.【答案】B【解析】【原文】W:Mr.Robinson,I need to speak to you about your daughter’s dog Lily.She keeps eating the flowers in my yard.M:I’m so sorry about that,Kathy.W:If you don’t do something,I will have no choice but to call the police.M:I will ask Mary to pay more attention to her dog.3.【此处可播放相关音频,请去附件查看】What is the correct order number for the yellow teapot?A.TP12FS.B.TP12SF.C.PT12FS.【答案】B【解析】【原文】M:Do you have an order number there,Mrs.Garcia?W:Yes.It’s TP12FS.M:One moment,please.Ah,yes.I see the problem.TP12FS is the correct number for the purple teapot.The yellow one you wanted is TP12SF.4.【此处可播放相关音频,请去附件查看】What do the man’s parents think of the magazine?eful.B.Outdated.C.Expensive.【答案】A【解析】【原文】M:I like this magazine.W:So do I.It gives you all the news.I like the beautiful photos in it.M:I always read the film reviews.I introduced it to my parents and they canceled their other newspapers.5.【此处可播放相关音频,请去附件查看】Why does the man decide to go to work by subway?A.It’s faster than driving.B.It’s more comfortable.C.It’s more convenient.【答案】C【解析】【原文】M:I’ll go to work by subway from tomorrow on.W:Why?You always drive to work.M:It’s so convenient to take a subway though it may not be as comfortable as a car.第二节(共15小题,每小题15分,满分22.5分)听下面5段对话或独白。

英语单词log是什么中文意思

英语单词log是什么中文意思

英语单词log是什么中文意思英语单词log是什么中文意思log看起来是一个简单的单词,但是许多人都忘记了它表达的中文意思是了。

为此店铺给大家整理了英语单词log表达的中文意思的内容,欢迎阅读参考!英语单词log表达的中文意思英 [lɒg] 美 [lɔ:g]名词日志; 记录; 原木动词伐木; 把…载入正式记录; 行驶及物动词1. How many hours have you logged?你的飞行时数是多少?2. How many hours did the plane log in the sky between April 30 and May 15?这架飞机于四月三十日至五月十五日期间在空中飞行了多少小时?3. We'll log another section of the area today.今天我们要砍伐另一部分林区的树木。

名词1. The captain of the ship entered the details in the log.船长把详细情况记入航海日志中。

2. The captain described the wreck accident in details in the ship's log.船长在航行日志里详细地描述了这次沉船事故。

英语单词log的英英释义noun1. measuring instrument that consists of a float that trails from a ship by a knotted line in order to measure the ship's speed through the water2. a written record of events on a voyage (of a ship or plane)3. a written record of messages sent or receivede.g. they kept a log of all transmission by the radio stationan email log4. the exponent required to produce a given numberSynonym: logarithm5. a segment of the trunk of a tree when stripped of branchesverb1. enter into a log, as on ships and planes2. cut lumber, as in woods and forestsSynonym: lumber英语单词log的双语例句1. At first This paper formulates the necessity for depurating network information content, and introduces the history of information security and study point in information security. Also this paper analyses in detail the key technology of content filtering such as data frame capture, bridge, network protocol analysis and content analysis. And then this paper comes up with the solution for content filtering facing application layer which is based on the technology of protocol analysis and format conversion in combination with text filtering techonology of Multiple Patterns Matching..Also this paper amplifies on the design and the implement method of each module. Finally this paper introduces how the client of content filtering system manages the keywords and log and makes suggestion for the improvement of the system.本文首先阐述了对网络信息内容净化的必要性,介绍了信息的发展过程以及当前信息安全方面的研究重点,详细分析了内容过滤的关键技术,如数据帧捕获、网桥、网络协议分析和内容分析等方面的技术,然后提出了在Windows2000下实现基于协议分析技术,结合多模式匹配的文本过滤技术,面向应用层的内容过滤解决方案,并给出了内容过滤引擎的详细设计以及各个模块的实现方法,最后介绍了内容过滤系统客户端对有害信息关键词和日志的管理以及对本系统改进的'一些建议。

基于网页日志的频繁模式挖掘

基于网页日志的频繁模式挖掘

基于网页日志的频繁模式挖掘作者:沈明, 邓玉芬, 张博来源:《现代电子技术》2010年第09期摘要:频繁模式挖掘应用广泛,是数据挖掘中的一个重点研究领域,频繁模式挖掘应用的其中一个领域就是基于网页日志的数据挖掘。

在网页日志中发现频繁模式的目的是获得用户的网络浏览行为模式,这些信息可以为广告设计以及创建动态用户日志提供参考。

从网页数据挖掘的角度研究了三种频繁模式挖掘方式,这三种方式分别是:网页设置、网页序列以及网页图片挖掘。

关键词:模式挖掘; 序列挖掘; 图形挖掘; 网页日志挖掘中图分类号:TP29 文献标识码:A文章编号:1004-373X(2010)09-0180-04Frequent Pattern Mining in Web Log DataSHEN Ming, DENG Yu-fen, ZHANG Bo(Navy Oceanic Mapping and Survey Institute, Tianjing 300061, China)Abstract: Frequent pattern mining is an important research field in data mining with wide application, one of the fields is the data mining based on Web log data. The aim of discoveringavigational behavior of the users, the information can provide references for advertising purpose and creating dynamic user profiles. Three pattern mining approaches are investigated from the Web data mining, the different patterns in Web log mining are page set, page sequence and page graphs mining.Keywords: pattern mining; sequence mining; graph mining; Web log mining0 引言万维网提供了大量对用户有用的数据,不同类型的数据应该组织成能够被不同用户有效使用的形式,因此,基于网页的数据挖掘技术吸引了越来越多的研究人员。

信息调查分析英语作文

信息调查分析英语作文

信息调查分析英语作文Title: Information Investigation and Analysis。

Information investigation and analysis play crucial roles in various aspects of modern society, ranging from business intelligence to law enforcement and scientific research. This essay will delve into the significance of information investigation and analysis in English.To begin with, information investigation involves the systematic gathering, examination, and interpretation of data to uncover patterns, trends, and insights. In today's digital age, where vast amounts of data are generated every second, the ability to conduct effective information investigations is more valuable than ever. Whether it's analyzing market trends, identifying potential security threats, or understanding consumer behavior, businesses and organizations rely on information investigation to make informed decisions and stay competitive.One of the key aspects of information investigation is data collection. This involves gathering relevant data from various sources, such as databases, websites, and social media platforms. However, it's essential to ensure the accuracy and reliability of the collected data to avoid drawing incorrect conclusions. Techniques such as data validation and verification are often employed to assessthe quality of the data and identify any discrepancies or errors.Once the data is collected, the next step is analysis. This involves applying various analytical techniques, such as statistical analysis, data mining, and predictive modeling, to extract meaningful insights from the data. For example, in the field of marketing, businesses may analyze customer purchasing patterns to identify potential target markets and tailor their advertising strategies accordingly. Similarly, in law enforcement, investigators may analyze communication records to uncover criminal activities and track down suspects.Moreover, information investigation and analysis play acrucial role in scientific research and discovery. Scientists use data analysis techniques to interpret experimental results, test hypotheses, and uncover new knowledge. For instance, in the field of genomics, researchers analyze vast datasets of genetic information to understand the genetic basis of diseases and develop new treatments.Furthermore, information investigation is integral to the field of cybersecurity. With cyber threats becoming increasingly sophisticated, organizations need to continuously monitor their networks for any signs of malicious activity. Information investigation techniques, such as log analysis and forensic analysis, are essential for identifying security breaches, tracing the source of attacks, and mitigating potential damage.In conclusion, information investigation and analysis are indispensable tools in today's data-driven world. Whether it's in business, law enforcement, scientific research, or cybersecurity, the ability to gather, analyze, and interpret data effectively can lead to valuableinsights and informed decision-making. By harnessing the power of information investigation, organizations can gain a competitive edge and address challenges more effectively in an ever-evolving landscape.。

《数据仓库与数据挖掘技术》第7章:非结构化数据挖掘

《数据仓库与数据挖掘技术》第7章:非结构化数据挖掘

机器学习、专有算 法(如HITS pagerank)
页面权重分类聚 类、模式发现
用户访问挖掘
Serverlog,proxy serverlog,client log
关系表、图
统计、机器学习、 关联规则
用户个性化、自适 应Web站点、商业
决策
7.1 Web数据挖掘
Web挖掘的基本构架
访问者
注册用户
网站
交易信息
浏览信息
数据库、数据 仓库
Web日志文件
Web Serer中其他信息
数据预处理模块
结构数据挖掘模块
非结构数据挖掘模块
页面访问情况
Web结构模式
Web内容模式
Web挖掘的基本构架
知识
7.1 Web数据挖掘
7.1.3 Web内容挖掘
信息检索(information retrieve ,IR)方法 数据库方法
交互式学习
用户
信息集成
图7.5 功能驱动的多媒体挖掘体系结构 模式知识级处理
用户
模式知识 领域知识 多媒体处理
语义概念级处理 对象级特征处理
语义级检索与索引 对象级的索引与检索
物理级底层特性处理设计
基于底层特性的索引与检索元数 据与数据抽取原则的分析
图7.6 信息驱动的多媒体挖掘的结构
数据库方法
无结构和半结构化数据
自由文本、HTML标记的超 文本
半结构化数据
HTML标记的超文 本
词集、段落、概念、IR的 三种经典模型
TFIDF、统计、机器学习、 自然语言理解
OEM 关系 数据库技术
分类、聚类、模式发现
模式发现、数据 向导、多维数据 库、站点创建与

Research issues in web structural delta mining

Research issues in web structural delta mining

Research Issues in Web Structural Delta MiningQiankun Zhao1,Sourav S.Bhowmick1,and Sanjay Madria21School of Computer Engineer,Nanyang Technological University,Singapore.qkzhao@.sg,assourav@.sg2Department of Computer Science,University of Missouri-Rolla,USA.madrias@Summary.Web structure mining has been a well-researched area during recent years.Based on the observation that data on the web may change at any time in any way,some incremental data mining algorithms have been proposed to update the mining results with the correspond-ing changes.However,none of the existing web structure mining techniques is able to extract useful and hidden knowledge from the sequence of historical web structural changes.While the knowledge from snapshot is important and interesting,the knowledge behind the corre-sponding changes may be more critical and informative in some applications.In this paper, we propose a novel research area of web structure mining called web structural delta min-ing.The distinct feature of our research is that our mining object is the sequence of historical changes of web structure(also called web structural deltas).For web structural delta min-ing,we aim to extract useful,interesting,and novel web structures and knowledge considering their historical,dynamic,and temporal properties.We propose three major issues of web struc-tural delta mining,identifying useful and interesting structures,discovering associations from structural deltas,and structural change pattern based classifier.Moreover,we present a list of potential applications where the web structural delta mining results can be used.1IntroductionWith the progress of World Wide Web(WWW)technologies,more and more data are now available online for web users.It can be observed that web data covers a wide spectrum offields from governmental data via entertaining data,commercial data,etc.to research data.At the same time,more and more data stored in traditional repositories are emigrating to the web.According to most predictions in1999,the majority of human data will be available on the web in ten years[9].The availability of huge amount of web data does not imply that users can get whatever they want more easily.On the contrary,the massive amount of data on the web has already overwhelmed our abilities tofind the desired information.It has been observed that 99%of the data reachable on the web is useless to99%of the users[10].However, the huge and diverse properties of web data do imply that there should be useful knowledge hidden behind web data,which cannot be easily interpreted by human intuition.2Qiankun Zhao,Sourav S.Bhowmick,and Sanjay MadriaRH 8.0RH 7.2Linux pro(a)RH 9.0RH 8.0(b)Fig.1.Two Versions of the Web Site Structure of Under such circumstance,web mining was initially introduced to automatically discover and extract useful but hidden knowledge from web data and services [8].Web mining was defined as a converging research area from several research com-munities,such as database,information retrieval,machine learning,and natural lan-guage processing [13].The objects of web mining are web resources,which can be web documents,web log files,structure of web sites,and structure of web docu-ments themselves.Recently,many research efforts have been directed to web mining and web mining is now widely used in different areas [13].Search engines,such as Google ,use web mining technique to rank query results according to the importance and relevance of these pages [18].Web mining is also expected to create structural summarizations for web pages [17],which can make search engines work more ef-ficiently as well.Moreover,web mining is used to classify web pages [12],identify web communities [4],etc.One of the key features of web data is that it may change at any time in any way.New data are inserted to the web;obsolete data are deleted while some others are modified.Corresponding to the types of web data,changes can be classified into three categories:changes of web content ,changes of web structure ,and changes of web usage .Due to the autonomous nature of the web,these changes may occur without notifying the users.We believe that the dynamic and autonomous properties of web data pose both challenges and opportunities to the web mining community.Research Issues in Web Structural Delta Mining3 Let us elaborate on this further.The knowledge and information mined from ob-solete data may not be valid and useful any more with the changes of web data.Let us take one of the classic web structure mining algorithms HITS[12]for example.With the evolution of pages on the web,more and more web pages are created and linked to some of the existing web pages;some outdated web documents are deleted along with corresponding links while others hyperlinks may be updated due to changes of web content.Consequently,the set of authoritative and hub pages[12]computed at time t1may change at time t2.That is some of the previously authoritative pages may not be authoritative any more.Similar cases may happen to hub pages.Thus, the mining results of the HITS algorithm may not be accurate and valid any more with the changes of web data.With the dynamic nature of web data,there is an opportunity to get novel,more useful and informative knowledge from their historical changes,which cannot be discovered using traditional web mining techniques on snapshot data.For exam-ple,suppose we have two versions of the web site structure as shown in Figure1.In thisfigure,we use the grey boxes to represent pages deleted from the previous version,black boxes to denote newly inserted pages and bolded boxes to represent updated pages.From the changes of web structure in the two versions,it can be observed that when the information of products changes such as new products are added or outdated products are deleted,the information on training will change accordingly.For instance,in Figure1(b),when a new product RH9.0is added,a new training corresponding to this product RH402is inserted.In this example,it can be inferred that the information of Product and Service and Training may be associ-ated.Such inference can be verified by examining the historical web structural deltas. Then knowledge about associations of structure changes can be extracted by apply-ing association rule mining techniques to the historical changes of web structural data.The extracted knowledge can be rules such as changes of substructure A im-ply changes of substructure B within certain time window with certain support and confidence.Besides the association rules,interesting substructures,and enhanced classifiers can also be extracted from the historical web structural changes.From the above example,we argue that knowledge extracted from historical structural deltas is more informative compared to the mining results from snapshot data.Based on the above observations,in this paper,we propose a novel approach to extract hidden knowledge from the changes of the historical web structural data(also known as web structural delta).Firstly,this approach is expected to be more efficient since the mining object is the sequence of deltas that is generally much smaller than the original sequence of structural data in terms of size.Secondly,novel knowl-edge that cannot be extracted before can be discovered by incorporating the dynamic property of web data and temporal attributes.The intuition behind is that while the knowledge from snapshot is important and interesting,the knowledge behind the cor-responding changes may be more critical and informative.Such knowledge can be extracted by using different data mining techniques such as association rule mining, sequential pattern mining,classification,and clustering[10].In this paper,we focus on exploring research issues for mining knowledge from the historical changes of web structural data.4Qiankun Zhao,Sourav S.Bhowmick,and Sanjay MadriaThe organization of this paper is as following.In Section2,we present a list of related works.It includes web structure mining techniques and change detection sys-tems for web data.The formal definition of web structural delta mining is presented in Section3.In addition,different research issues are discussed in this section.The list of applications where the web structural delta mining results can be used is pre-sented in Section4.Finally,the last section concludes the paper.2Related WorkOur proposed web structural delta mining research is largely influenced by two re-search communities.The web mining community has looked at developing novel algorithms to mine snapshots of web data.The database community has focused on detecting,representing,and querying changes to the web data.We review some of these technologies here.2.1Web Structure MiningOver the last few years,web structure mining has attracted a great deal of attention in the web mining community.Web structure mining was initially inspired by the study of social network and citation analysis[13].Web structure mining was defined to generate structural summary about web sites and web pages.It includes the study of hyperlinked structure of the web[18],categorizing web pages into authoritative pages and hub pages[12],and generating community information with respect to the similarity and relations between different web pages and web sites[4].We give a brief review of these techniques that includes two classic web structure mining algorithms and some algorithms for identifying web communities.PageRank Algorithm:One of the algorithms that analyze the hyperlink structure of the web is PageRank [18].PageRank is an algorithm developed in Stanford University and now employed by the web search engine Google.PageRank is used to rank the search results of the search engine according to the corresponding PageRank values,which is calculated based on the structure information.In essence,PageRank interprets a link from page A to page B as a vote,by page A,for page B.However,PageRank looks at more than the sheer volume of votes,or links a page receives;it also analyzes the page that casts the vote.V otes cast by pages that are themselves“important”weigh more heavily and help to make other pages“important”.Important and high quality web sites in the search results receive high ranks.HITS Algorithm:Most research efforts for classifying web pages try to categorize web pages into two classes:authoritative pages and hub pages.The idea of authoritative and hub pagesResearch Issues in Web Structural Delta Mining5 was initialized by Kleinberg in1998[12].The authoritative page is a web page that is linked to by most web pages that belong to this special topic,while a hub page is a web page that links to a group of authoritative web pages.One of the basic algorithms to detect authoritative and hub pages is named HITS[12].The main idea is to classify web pages into authoritative and hub pages based on the in-degree and out-degree of corresponding pages after they have been mapped into a graph structure.This approach is purely based on hyperlinks.Moreover,this algorithm focuses on only web pages belonging to a specific topic.Web Community Algorithms:Cyber community[4]is one of the applications based on the analysis of similarity and relationship between web sites or web pages.The main idea of web commu-nity algorithms is to construct a community of web pages or web sites that share a common interest.Web community algorithms are clustering algorithms.The goal is to maximize the similarity within individual web communities and minimize the similarity between different communities.The measures of similarity and relation-ship between web sites or web pages are based on not only hyperlinks but also some content information within web pages or web sites.2.2Change Detection for Web DataConsidering the dynamic and autonomous properties of web data,recently many ef-forts have been directed into the research of change detection for web data[7,16, 15,6,19].According to format of web documents,web data change detection tech-niques can be classified into two categories.One is for HTML document,which is the dominant of current web data.Another is for XML document,which is expected to be the dominant of web data in the near future.We briefly review some of the web data change detection techniques now.Change Detection for HTML Document:Currently,most of the existing web documents are in HTML format,which is de-signed for the displaying purpose.An HTML document consists of markup tags and content data,where the markup tags are used to manipulate the representation of the content data.The changes of HTML documents can be changes of the HTML markup tags or the content data.The changes can be sub page level or page level. The AT&T Internet Difference Engine(AIDE)[7]was proposed tofind and display changes to web pages.It can detect changes of insertion and deletion.WebCQ[16] is a system for monitoring and delivering web information.It provides personalized services for notifying and displaying changes and summarizations of corresponding interested web pages.SCD algorithm[15]is a sub page level change detection al-gorithm,which detects semantic changes of hierarchical structured data contents in any two HTML documents.6Qiankun Zhao,Sourav S.Bhowmick,and Sanjay MadriaChange Detection for XML Document:Recently,XML documents are becoming more and more popular to store and ex-change data in the web.Different techniques of detecting changes for XML docu-ments have been proposed[6,19].For instance,XyDiff[6]is used to detect changes of ordered XML documents.It supports three types of changes:insertion,deletion, and updating.X-Diff[19]is used to detect changes of unordered XML documents. It takes the XML documents as unordered tree,which makes the change detection process more difficult.In this case,two trees are equivalent if they are isomorphic, which means that they are identical except for the orders among siblings.The X-Diff algorithm can also identify three types of changes:insertion,deletion,and updating. 3Web Structure Delta MiningBased on recent research work in web structure mining,besides the validity of min-ing results and the hidden knowledge behind historical changes that we mentioned earlier,we observed that there are two other important issues have not been addressed by the web mining research community.•Thefirst observation is that existing web structure mining algorithms focus only on the in degree and out degree of web pages[12].They do not consider the global structural property of web documents.Global properties such as the hier-archy structure,location of the web page among the whole web site and relations among ancestor and descendant pages are not considered.However,such infor-mation is important to understand the overall structure of a web site.For example, each part of the web structure is corresponding to a underlining concept and its instance.Consequently,the hierarchy structure represents the relations among the concepts.Moreover,based on the location of each concept in the hierarchy structure,the focus of a web site can be extracted with the assumption that the focus of the web site should be easy to be accessed.•The second observation is that most web structure mining techniques ignored the structural information within individual web documents.Even if there are some algorithms designed to extract the structure of web documents[2],these data mining techniques are used to extract intra-structural information but not to mine knowledge behind the intra-structural information.With the increasing popular-ity of XML documents,this issue is becoming more and more important because XML documents carry more structural information compared to its HTML coun-terpart.Web structural delta mining is to address the above limitations.As shown in Fig-ure2,it bridges the two popular research areas,web structure mining and change detection of web data.In this section,we willfirst give a formal definition of web structural delta mining from our point of view.Next,we elaborate on the major re-search issues of web structural delta mining.Research Issues in Web Structural Delta Mining7Fig.2.Web Structural Delta Mining3.1Problem StatementThe goal of web structural delta mining is to extract any kind of interesting and useful information from the historical web structural changes.As the object of web structural delta mining can be structures of web sites,structures of a group of linked web page and even structures within individual web page,we introduce the term web object to define such objects.Definition1.Let O={w1,w2,···,w n}be a set of web pages.O is a web object if it satisfies any one of the following constraints:1)n=1;2)For any1≤i≤n,w i links to or is linked by at least one of the pages from{w1,···,w i−1,w i+1,···,w n}.Fig.3.Architecture of Web Structural Delta MiningFrom the definition,we can see that a web object can be either an individual web page or a group of linked web pages.Thus,the structure of a web object O refers to the intra-structure within the web page if web object O includes only one web page,otherwise it refers to the inter-structure among web pages in this web object. The web object is defined in such a way that each web object corresponds to an instance of a semantic concept.With respect to the dynamic property of web data, we observed that some web pages or links might be inserted into or deleted from8Qiankun Zhao,Sourav S.Bhowmick,and Sanjay Madriathe web object and for individual web page the web page itself may also change over time.Consequently,the structure of a web object may also change.Our web structural delta mining is to analyze the historical structural changes of web objects. In our point of view,web structural delta mining is defined as follows.Definition2.Let S1,S2,···,S n be a sequence of historical web structural infor-mation about a web object,where S i is i-th version of the structural information about the web object O at time t i. S1,S2,···,S n are in the order of time se-quence.Assume that this series of structural information records all versions of structural information for a period of time.The objective of web structural delta mining is to extract structures with certain changes patterns,discover associations among structures in terms of their changes patterns,and classify structures based on the historical change patterns using various data mining techniques.Our definition of web structural delta mining is different from existing web struc-ture mining definitions.In our definition,we incorporate the temporal(by taking dif-ferent versions of web structures as a sequence),dynamic(by detecting the changes between different versions of web structures),and hierarchical property of web struc-tural data(by taking into account the hierarchy structures of web sites)as shown in Figure3.The basic idea is as follows.First,given a sequence of historical web struc-tural data,by using the modification of some existing web data change detection systems,the sequence of historical web structural deltas can be extracted.On the other hand,based on the dynamic metric,global metric and temporal metric,dif-ferent types of interesting structures can be defined based on their historical change patterns.Based on these definitions,the desired structures can be extracted from the sequence of historical web structural delta by using some data mining techniques. Besides interesting substructures,other knowledge such as association among struc-tural deltas,structural change pattern based classifiers can also be discovered.Ac-cording to the definition,web structural delta mining includes three issues:identify interesting and useful substructures,extract associations among structural deltas, and structural change pattern based classifier.We now discuss the details of the three issues in turn.3.2Identify Interesting and Useful SubstructuresWe now introduce different types of interesting and useful substructures that may be discovered using web structural delta mining.In Section4,we will elaborate on the applications of these structures.The interesting substructures include frequently changing structure,frozen structure,surprising structure,imbalanced structure,pe-riodic dynamic structure,increasing dynamic structure,and decreasing dynamic structure.Due to the lack of space,we will elaborate only on thefirst four of them. Frequently Changing Structure:Given a sequence of historical web structural data about a web object,we may ob-serve that different substructures change at different frequency with different signif-icance.Here frequently changing structure refers to substructures that change moreResearch Issues in Web Structural Delta Mining9 frequently and significantly compared with other substructures[22].Let us take the inter-structure of the in Figure1as an example.In thisfigure,we observed that some of the substructures have changed between the two versions while others did not.The substructures rooted at nodes of Product and Service and Train-ing changed more frequently compared to other substructures.In order to identify the frequently changing structure,we proposed two dynamic metrics node dynamic and version dynamic in[22].The node dynamic measures the significance of the struc-tural changes.The version dynamic measures the frequency of the structural changes against the history.Based on the dynamic metrics,the frequently changing structure can be defined as structures whose version dynamic and node dynamic are no less than the predefined thresholds.Two algorithms for discovering frequently changing structures from historical web structural delta have been proposed in[22].Frozen Structure:Frozen structures are the inverse to the frequently changing structures,as these struc-tures seldom change or never change.To identify such kind of structures,we intro-duce another type of useful structure named frozen structure.Frozen structure,from the words themselves,refers to those structures that are relatively stable and seldom change in the history.Similarly,based on the dynamic metric,frozen structure can be defined as structures whose values of node dynamic and version dynamic do not exceed certain predefined thresholds.Surprising Structure:Based on the historical dynamic property of certain structures,the corresponding evolutionary patterns can be extracted.However,for some structures,the changes may not always be consistent with the historical patterns.In this case,a metric can be proposed to measure the surprisingness of the changes.If for certain number of changes,the surprisingness exceeds certain threshold,then,the structure is defined as a surprising structure.Any structure whose change behavior deviate from the knowledge we learned from the history is a surprising structure.For example,a frozen structure that suddenly changed very significantly and frequently may be a surprising structure;a frequently changing structure suddenly stopped to change is also a surprising structure.The surprising structure may be caused by some abnormal behaviors,mistakes,or fraud actions.Imbalanced Structure:Besides the dynamic property of web structure,there is another property,global property,which has not been considered by the web structure mining community. If we represent the World Wide Web as a tree structure,here global property of the structure refers to the depth and cardinality of a node,which was introduced in the imbalance structure research initiated in the context of hypertext structure analysis by10Qiankun Zhao,Sourav S.Bhowmick,and Sanjay Madria(a) A Web Site Structure Version 1 (b) A Web Site Structure Version nFig.4.Two Versions of a Web Site StructureBotafogo et al [3].In their work,two imbalance metrics,depth imbalance and child imbalance ,were used to measure the global property of a structure.The imbalanced metric is based on the assumption that each node in the structure carries the same amount of information and the links from a node are further development of the node.Based on the assumption,an ideal structure of information should be a balanced tree.However,a good structure may not be definitely balanced since some structures are designed to be imbalanced by purpose.In our study,we assume that the first version of any structure is well designed even if there are some imbalanced structures.Our concern is that as web data is au-tonomous,some balanced structure may become imbalanced due to changes to web data.We argue that sometimes such kind of changes is undesirable.For example,suppose Figure 4(a)is the first version of a web site structure.Figure 4(b)depicts the modified web site structural after a sequence of change operations over a period of time.From the two versions,we can observe the following changes.The depth of node 8changed from 2to 5;the number of descendant nodes of node 3increases dra-matically while the numbers of descendant nodes of its two siblings did not change.One of the consequences of these changes is that the cost of accessing node 8from the root node in Figure 4(b)is more expensive than the cost in Figure 4(a).The depth of a node reflects the cost to access this particular node from the root node.Nodes that are very deep inside the tree are unlikely to be visited by the majority of the users.Another consequence is that node 3is developed with more information in Figure 4(b)than in Figure 4(a).The number of descendant nodes indicates the im-portance of the node.With such information,the values of the imbalance metrics can be calculated and the imbalance structures can be identified.Based on the imbalance structures,web designers can check whether such imbalance structures are desirable or not.Consequently,a web structure of higher quality can be maintained.Besides identifying the imbalanced structures,we want to further analyze the changes to find out what kind of changes may have the potential to cause imbalance of a structure.With such knowledge,the undesirable imbalance structures can be avoided by taking appropriate actions in advance.3.3Extract Associations among Structural DeltasAnother important research issue of web structural delta mining is to discover as-sociation of structural deltas[5].The basic idea is to extract correlation among the occurrences of different changes of different web structures.It is similar to the tradi-tional association rule mining in some senses if we treat each possible substructure as an item in the transaction database.However,our structure association rule min-ing is more complex due to following reasons.Firstly,not every combination of two length(i-1)items in the database can be a candidate length i item since if they cannot form a complete structure it will be meaningless.Therefore,a more complex candi-date item generation method is desirable compared to the Apriori in[1].Secondly, for each item in the database the types of changes can be different such as insertion, deletion,and update,while in traditional transaction data there is no such attribute. This makes the mining process more complex[1].There are two types of association rules among the structural deltas.Structural Delta Association Rule Mining:Considering the changes of web structural data,it can be observed that some of the substructures changed simultaneously more often than others.The goal of structural delta association rule mining is to discover the sets of substructures that change to-gether with certain support and confidence[5].An example of the structural delta association rule is S1→S2,where S1and S2are two different substructures.For instance,based on a sequence of historical versions of the structure in Figure1,we may discover that the substructures rooted at nodes Product and Service and Training are strongly associated with respect to the changes.Whenever the structure rooted at node Product and Service changes,the structure rooted at node Training also changes with certain confidence in the history.However,besides the presence of the substruc-tures,the types of changes and the significance of the changes can be different for different substructures.Consequently,more informative and useful structural delta association rules can be extracted by using multiple dimensional association rule mining techniques.Semantic Delta Association Rule Mining:Besides the structural delta association rule mining,in which only the structural in-formation is considered,semantic association rules can also be extracted from histor-ical structural deltas if we incorporate the meta data such as summaries or keywords of web pages in the web site into the association rules.In this case,we want to extract knowledge such as which semantic objects are associated in terms of their change histories.An example of the semantic delta association rule can be that more。

如何辨别网络信息英语作文

如何辨别网络信息英语作文

如何辨别网络信息英语作文How to Discriminate Online Information.In the era of information technology, the internet has become a ubiquitous presence in our daily lives, providing us with a vast array of knowledge and resources. However, with the proliferation of online content, it has become increasingly challenging to discern the authenticity and credibility of the information we encounter. This essayaims to delve into the intricacies of discerning online information, discussing the importance of critical thinking, fact-checking, and evaluating the credibility of sources.Critical thinking is fundamental in distinguishing reliable information from misinformation. It involves actively analyzing, evaluating, and interpreting thecontent we encounter online. When confronted with a claimor statement, it is crucial to ask questions such as: Whois the author? What are their qualifications and biases? What evidence supports the claim? Are there any alternativeexplanations or perspectives? By engaging in critical thinking, we can avoid being swayed by unverified assertions or uncritical acceptance of information.Fact-checking is another crucial step in verifying the accuracy of online content. In the age of the internet, false information can spread rapidly, often reaching a wide audience before being fact-checked or corrected. Therefore, it is essential to verify the facts behind the information we encounter. One effective method is to cross-check the information against multiple reliable sources. If the same information is consistently reported by multiple credible news outlets or authorities, it is more likely to be accurate. Additionally, utilizing tools like search engines and fact-checking websites can help us verify the authenticity of claims.Evaluating the credibility of sources is also paramount in discerning online information. Credible sources are typically reliable, trustworthy, and have a reputation for accuracy and objectivity. When assessing the credibility of a source, we should consider factors such as the source'sreputation, the qualifications of its writers or reporters, and its commitment to accuracy and transparency. Credible news outlets, for instance, often have a rigorous vetting process for their content, ensuring that the information they publish is accurate and reliable. By contrast, sources that lack transparency, frequently publish unverified information, or have a history of publishing falsehoods should be treated with caution.Moreover, it is essential to recognize the existence of online biases and agendas. The internet is a diverse platform where various individuals and organizations have their own perspectives and agendas. Some may seek to manipulate or influence public opinion through the dissemination of misinformation or propaganda. Therefore,it is crucial to be vigilant and aware of these biases when evaluating online information. We should strive to consume information from a diverse range of sources, including those with different perspectives, to ensure a balanced understanding of issues.In conclusion, discerning online information requires acombination of critical thinking, fact-checking, and evaluating the credibility of sources. As we navigate the vast expanse of the internet, it is crucial to approach information with skepticism and a mindset of skepticism. By cultivating these skills and applying them to our online experiences, we can better discern reliable information from misinformation, ensuring that we are informed andwell-versed in the knowledge we consume.。

正确辨别网络信息英语作文

正确辨别网络信息英语作文

正确辨别网络信息英语作文Title: Correctly Identifying Online Information。

In today's digital age, the ability to discern the reliability and accuracy of online information is more crucial than ever. With the vast amount of contentavailable on the internet, ranging from credible sources to misinformation and fake news, it has become increasingly challenging to separate fact from fiction. In this essay, we will explore strategies for correctly identifying online information.First and foremost, it is essential to verify the credibility of the source. Reliable sources typically include established news organizations, reputable research institutions, and government websites. When evaluating a source, consider factors such as the author's expertise, the publication's reputation, and whether the information is supported by evidence or citations.Additionally, cross-referencing information from multiple sources can help confirm its accuracy. If a particular claim or story is only reported by one source,it is advisable to seek corroboration from other reputable sources before accepting it as true. Furthermore, comparing information across different perspectives can provide a more comprehensive understanding of a topic and help identify biases or inaccuracies.Another important aspect of discerning online information is evaluating the credibility of individual pieces of content, such as articles, videos, or social media posts. Pay attention to the language used, the presence of grammatical errors or sensationalist headlines, and the use of anonymous or unverifiable sources. Be cautious of content that elicits strong emotional responses or seems designed to provoke controversy without providing substantive evidence.Moreover, consider the context in which the information is presented. Misinformation often spreads rapidly during times of crisis or uncertainty, taking advantage ofpeople's fears and anxieties. Take a moment to pause and critically evaluate information before sharing it with others, especially on social media platforms where misinformation can proliferate quickly.Furthermore, be mindful of the role that algorithms and filter bubbles play in shaping the information we encounter online. Social media platforms and search engines use algorithms to personalize content based on our browsing history and preferences, potentially creating echo chambers where we are only exposed to information that aligns with our existing beliefs. To counteract this effect, actively seek out diverse perspectives and engage with a variety of sources.In conclusion, correctly identifying online information requires a combination of critical thinking skills, skepticism, and digital literacy. By verifying the credibility of sources, cross-referencing information, evaluating individual pieces of content, considering the context, and being mindful of algorithmic biases, we can become more discerning consumers of online information. Inan age where misinformation and fake news abound, these skills are more important than ever in order to navigate the digital landscape responsibly.。

提高网络辨别能力英语作文

提高网络辨别能力英语作文

提高网络辨别能力英语作文English.1. How can I improve my ability to evaluate the credibility of information on the internet?Critical Thinking Skills:Question sources: Examine the author's credentials, affiliation, and biases.Check facts: Verify claims through reliable sources and fact-checking websites.Evaluate evidence: Assess the quality, relevance, and sufficiency of supporting evidence.Consider perspectives: Seek out diverse viewpoints and consider alternative interpretations.Apply logical reasoning: Identify logical fallacies and biases that may distort the information.Digital Literacy Skills:Use search engines effectively: Employ advanced search techniques to filter results and find trustworthy sources.Evaluate website design: Pay attention to the site's layout, content quality, and security measures.Identify sponsored content: Be aware of advertisements and paid promotions disguised as legitimate news articles.Use social media responsibly: Critically evaluate information shared on social media platforms, considering the biases and motivations of the posters.Engage with trusted sources: Follow reputable news organizations, fact-checking websites, and subject matter experts on social media.2. What tools and resources can I use to help me evaluate the credibility of information on the internet?Fact-checking websites: Snopes, PolitiFact, .Media literacy organizations: Media Matters for America, First Draft News.Search engine features: Google Fact Check Tools, Wikipedia Knowledge Graph.Browser extensions: NewsGuard, Web of Trust.Social media verification tools: Twitter "blue checkmarks," Facebook debunking features.3. How can I teach others about evaluating the credibility of information on the internet?Encourage critical thinking: Foster skeptical thinking and analytical skills.Provide resources: Share reliable fact-checking and media literacy websites.Hold discussions: Engage in discussions about online information sources and highlight warning signs of bias and credibility issues.Use role-playing: Conduct simulations where students evaluate the credibility of different information sources.Create awareness campaigns: Develop educational materials and raise awareness about the importance of digital literacy.中文回答。

煤矿助理工程师职称评定条件及流程

煤矿助理工程师职称评定条件及流程

煤矿助理工程师职称评定条件及流程Being granted the title of Assistant Mine Engineer is a significant milestone and achievement for individuals in the mining industry. The criteria for evaluating and determining who is eligible for this title are rigorous and demanding. To begin with, applicants must possess a strong educational background in mining engineering or related fields. They should have obtained a relevant degree from an accredited university or college and demonstrated a deep understanding of mining principles and practices through their coursework.获得煤矿助理工程师职称对于矿业领域的个人来说是一个重要的里程碑和成就。

评定谁有资格获得这一头衔的标准是严格而苛刻的。

首先,申请人必须拥有扎实的采矿工程或相关领域的教育背景。

他们应该从一所认可的大学或学院获得相关学位,并通过课程表现出对采矿原则和实践的深刻理解。

In addition to academic qualifications, practical experience in the mining industry is also a crucial requirement for individuals seeking to become Assistant Mine Engineers. Applicants are expected to have a certain number of years working in various capacities withinthe mining sector, gaining hands-on experience in different aspects of mine operations. This practical experience is invaluable as it allows candidates to apply theoretical knowledge to real-world situations, developing problem-solving skills and a comprehensive understanding of the industry.除了学术资格,对矿业领域的实际经验也是成为煤矿助理工程师的人所必须的关键条件。

流程挖掘的组织维度

流程挖掘的组织维度

流程挖掘的组织维度I believe that one of the most crucial aspects of process mining is the organizational dimension it brings to light. 我相信流程挖掘最关键的一个方面是它揭示出的组织维度。

By analyzing the flows of activities and events within an organization, process mining can provide valuable insights into how different departments, teams, and individuals work together. 通过分析组织内部的活动和事件流程,流程挖掘可以为我们提供宝贵的洞察,了解不同部门、团队和个人是如何协同工作的。

It allows us to see the big picture of how work is being done and where bottlenecks or inefficiencies may exist. 这让我们能够看到工作是如何完成的整体情况,并找出存在瓶颈或低效的地方。

Understanding the organizational dimension of processes can help improve efficiency, reduce costs, and enhance collaboration among different parts of the organization. 理解流程的组织维度有助于提高效率、降低成本,并增强组织不同部分之间的协作。

Moreover, by uncovering the organizational dimension of processes, process mining can help identify opportunities for automation and optimization. 此外,通过揭示流程的组织维度,流程挖掘可以帮助我们找到自动化和优化的机会。

环境评价 英文作文

环境评价 英文作文

环境评价英文作文英文:Environmental assessment is an important process that evaluates the potential impacts of a project on the environment. As an environmental consultant, I have conducted numerous environmental assessments for various types of projects, such as construction, mining, and oil and gas development.One of the key aspects of an environmental assessment is the identification of potential environmental impacts. This involves a thorough analysis of the project's activities, including the construction, operation, and decommissioning phases. For example, if a mining project is proposed, we would need to assess the potential impacts on water quality, air quality, and biodiversity.Once the potential impacts are identified, we then evaluate their significance. This involves comparing thepotential impacts to relevant environmental standards and guidelines. For example, if the project is located near a sensitive ecosystem, we would need to assess whether the potential impacts would exceed the allowable limits set by the government.Based on the significance of the potential impacts, we then develop mitigation measures to reduce or eliminate them. For example, if the mining project would have significant impacts on water quality, we may recommend the implementation of a water treatment system.Overall, environmental assessment is a crucial part of responsible project development. It ensures that potential environmental impacts are identified and addressed, and that the project is developed in a sustainable manner.中文:环境评价是一项重要的过程,评估项目对环境的潜在影响。

提高网络辨别能力英语作文

提高网络辨别能力英语作文

提高网络辨别能力英语作文英文回答:Improving Cyber Discernment.In the digital age, where information is readily accessible and disseminated, it is imperative to cultivate cyber discernment. This involves the ability to critically evaluate online information, verify its authenticity, and identify potential biases or misinformation.There are several key strategies to enhance cyber discernment:1. Critical Reading and Information Validation: Thoroughly scrutinize online sources, considering factors such as the author's credibility, publication date, and the use of evidence. Use fact-checking websites and cross-reference information from multiple reputable sources to validate claims.2. Understanding Biases and Cognitive Biases: Recognize that all sources have inherent biases, whether overt or implicit. Be aware of your own cognitive biases, such as confirmation bias and the tendency to seek information that aligns with existing beliefs.3. Digital Literacy and Media Understanding: Develop a comprehensive understanding of how online media works, including the role of algorithms, targeted advertising, and social media echo chambers. This knowledge empowers individuals to navigate the digital landscape more critically.4. Seeking Diversity and Alternative Perspectives: Actively seek out diverse sources of information, including those that challenge your current views. Be open to alternative perspectives and consider different viewpoints to broaden your understanding.5. Educating Yourself and Others: Continuously educate yourself about media literacy and critical thinking. Shareyour knowledge with others, fostering a culture of cyber discernment in your community.中文回答:提高网络辨别能力。

基于加权关联规则挖掘的 Web 日志最新频繁信息页面(IJMECS-V11-N10-5)

基于加权关联规则挖掘的 Web 日志最新频繁信息页面(IJMECS-V11-N10-5)

I.J. Modern Education and Computer Science, 2019, 10, 41-46Published Online October 2019 in MECS (/)DOI: 10.5815/ijmecs.2019.10.05Recent and Frequent Informative Pages from Web Logs by Weighted Association Rule MiningDr. SP. MalarvizhiAssociate Professor, Sri Vasavi Engineering College, Tadepalligudem, Andhra Pradesh, India.Email: spmalarvizhi1973@srivasaviengg.ac.inReceived: 17 July 2019; Accepted: 26 August 2019; Published: 08 October 2019Abstract—Web Usage Mining provides efficient ways of mining the web logs for knowing the user’s behavioral patterns. Existing literature have discussed about mining frequent pages of web logs by different means. Instead of mining all the frequently visited pages, if the criterion for mining frequent pages is based on a weighted setting then the compilation time and storage space would reduce. Hence in the proposed work, mining is performed by assigning weights to web pages based on two criteria. One is the time dwelled by a visitor on a particular page and the other is based on recent access of those pages. The proposed Weighted Window Tree (WWT) method performs Weighted Association Rule mining (WARM) for discovering the recently accessed frequent pages from web logs where the user has dwelled for more time and hence proves that these pages are more informative. WARM’s significance is in page weight assignment for targeting essential pages which has an advantage of mining lesser quality rules.Index Terms—Web logs, Web Mining, Page Weight Estimation, Weighted Minimum Support, WARM, WWT.I.I NTRODUCTIONThe Literature shows that Data Mining is a field which gains a rapid growth in recent days. Association Rule Mining (ARM) of this field plays a vital role in research [1]. Frequent itemset mining uses ARM algorithms to get the association amongst items based on user defined support and confidence [2]. Existing literature say that from the foremost frequent itemset mining algorithms like Apriori and FP-growth, many algorithms have so far been evolved. ARM gains application in business management and marketing.In case of WARM, every individual item is assigned a weight based on its importance and hence priority is given for target itemsets for selection rather than occurrence rate [3,4,5,6].Motive behind WARM is to mine lesser number of quality based rules which are more informative.Web log mining also known as web usage mining , a category of web mining is the most useful way of mining the textual log of web servers to enhance the website services. These servers carry the user’s interactions with the web [7, 8].This paper provides a method of WARM for mining the more informative pages from web logs by a technique called WWT where weight is assigned based on time dwelled by the visitor on a page and the recent access. Log is divided into n windows and weights are provided for each window. Last window is the recently accessed one and carries more weight. Along with the window weight the time dwelled on a page also adds to the priority of a target page to be mined.Rest of the paper is arranged as follows. Section 2 appraises related works of Frequent Patterns and WARM for Web log. Section 3 explains about the proposed system of how to preprocess the web logs, weight assignment techniques, WWT structure and WARM. Section 4 bears the experimental evaluation. Section 5 offers conclusion.II.R ELATED W ORKTo obtain the frequent pages from web logs and to provide worthy information about the users FP-growth algorithm is used [9].Web site structure and web server’s performance can be improved by mining the frequent pages of the web logs to cater the needs of the web users [10].A measure called w-support uses link based models to consider the transaction’s quality than preassigned weights [6].ARM does not take the weights of the items into consideration and assumes all items are equally important, whereas WARM reveals the importance of the items to the users by assigning a weight value to each item [11]. An efficient method is used for mining weighted association rules from large datasets in a single scan by means of a data structure called weighted tree [3].Wei Wang et al [4] proposed an effective method for WARM. In this method a numerical value is assigned for every item and mining is performed on a particular weight domain. F.Tao et al [5] discusses about weighted setting for mining in a transactional dataset and how to discover important binary relationships using WARM. Frequently visited pages that are recently used shows user’s habit and current interest. These pages may be mined by WARM techniques and can be made available in the cache of the server to make the web access speedy [12]. Here the web log is divided into several windowsand weight is assigned for each window. The latest window which carries the recently accessed pages possesses more weight.To deliver the frequent pages that acquire the interestingness of the users an enhanced algorithm is proposed by Yiling Yang et al. Content-link ratio and the inter-linked degree of the page group are the two arguments made use of in support calculation [13].Weight of each item is derived by using HITS model and WARM is performed [14]. This model produces quality oriented rules of lesser number to improve the accuracy of classification.Vinod Kumar et al [15] have contributed a strategy which aims in discovering the frequent pagesets from weblogs. It focuses on fuzzy utility based webpage sets mining from weblog databases. It involves downward closure property.K.Dharmarajan et al [16] have provided valuable information about users’ interest to obtain frequent access patterns using FP growth algorithm from the weblog data. Proposed system performs enhanced WARM for obtaining the frequent informative pages from textual web logs of servers. WWT method used in the system involves tree data structure. Database has to be divided into several windows and weights have to be assigned to the windows with the latest window having more weightage and then WWT is constructed.III. P ROPOSED S YSTEMExisting literature discusses about mining the frequent pagesets and association rules based on user defined minimum support and confidence. Proposed system aims at achieving quality rules from web logs using WARM. Weights for web pages visited by users are assigned based on factors like frequent access, time dwelled by visitors on web pages and how far the pages are recently used. Weighted Window Tree proposed here uses WARM to discover such recently used frequent pages from web log. Fig.1 shows the work flow process involved in the system.Fig.1. Work FlowA. Web Log PreprocessingWeb log is preprocessed for removing the duplicates, images, invalid and irrelevant data by cleaning. Data preprocessing is a more difficult task, but it serves reliability and data integrity necessary for frequent pattern discovery. Preprocessing takes about 80% of thetotal effort used up for mining [17]. Preprocessed web log then consists of attributes like IP address, Session ID, URLs of the pages visited by the user in that session ID, Requested date and time of each page and time dwelled on each page.The arrangement of data in the web log will be like a relational database model. The details of the pages requested by the users are stored consecutively in the web log. For each page requested all the attributes are updated. The log stored during a particular period from which the frequent informative pages are to be mined forms the back end relational database. Table 1 shows a part of the web log of an educational institution (EDI). Every requested page is reorganized as a separate entry with a distinct session ID. In the table session ID 3256789 under IP Address 71.82.20.69 has recorded three pages.Table 1. Partial web log of edi data setB. Windows and WeightsFrequent pages may have regularly occurred in the previous stage of a particular duration of a web log than in the later stage. In order to categorize most recently and least recently accessed frequent web pages WWT arrangement is introduced.The log for a particular duration is first divided into N number of windows, probably of equal size. While splitting windows, care should be taken to see that the pages of same session ID are not getting divided into two different windows and belong to a same window. To resolve this, division of windows is made based on total number of sessions available in the log divided by number of windows needed. This gives equal number of sessions S per window except for the latest window sometimes, which carries remainder number of sessions got by the division.Windows are given index numbers from bottom to top as 1 to N. Last window at the bottom of the log consisting of recently accessed web pages has index as 1 and is given a highest weight (hWeight) lying between 0 to 1, 0<hWeight < 1. Window 2, just before window 1 from bottom will have a weight less than that of window 1. Equation (1) is made use of to provide the weight for windows. It is understandable that window Index i < window index j and from equation (1) it is obvious thatweight i < weight j , i.e. when window index raises, window weight reduces [12]. Sessions of each window are renumbered separately from 1 to S as shown in Table.3.() ii weight hWeight = (1)Where i is the window index from 1 to N. After assigning weights for individual windows, each individual page URL in the windows gets assigned with weight equal to the weight of the window. Hence a particular URL, lying in the last window would have gained more weight than the same URL lying on other windows, which clearly shows the importance of the pages recently accessed.Same URL appearing in different windows may have different page dwelling time. The total URL weight of every individual URL is then found based on the occurrence rate of the page by using equation (2), which adds up all the products of individual window weight and dwelling time T of one particular URL lying in several windows. If URL Є i th window and j th session of i th window then,11(*)N SURL i ij i j W weight T ===∑∑ (2)In equation (2) N is the total number of windows, S is total number of sessions in i th window, weight i is the weight of i th window found in equation (1) which assigns the same weight to all the URLs in that window irrespective of sessions and T ij is the total dwelling time of a page in j th session of i th window. Number of products of (weight i *Tij ) incurred is the count of number of occurrences of the particular page URL. To calculate the average page weight or pageset (set of pages) weight, equation (3) is used.11()mps URL kk psW Wn ==∑ (3)In equation (3) W ps is the weight of a pageset ps and it is the ratio of sum of total weights of individual pages belonging to that page set to the number of visitors of that pageset n ps (number of session IDs in which the pageset is occurring) and m is the number of pages in the pageset. W URL is found using equation (2) for those records of ps having non zero entries for all the pages of the pageset. For 1-pageset n ps =n p and W ps =W p i.e. the individual page weights and hence m=1.C. Weighted Window Tree (WWT) and WARM WWT method consists of the following steps. (i) Constructing WWT. (ii) Compressing the tree.(iii) Mining recently and frequently visited pagesets. (iv) WARM.Table 2 shows a partial sample data log with dwelling time of the pages for few session IDs. Log is divided into windows and window weights are allotted. A relational database as shown in table 3 is constructed for the data in table 2 which aids in constructing the tree. Pages are numbered as p1, p2, p3,… from the last record of the last window (index 1) of the web log dataset towards the top window of a web log considered for a duration. When an URL gets repeated it is given the same page number.Table 2. Partial sample data logTable 3. Sample relational databaseThe entries under the pages of the sample relation are the product of dwelling time of the particular page by the particular visitor and the window weight in which the page is lying. Lowest window is the latest window and is given index 1. Weight of first window is assumed 0.5 (between 0 to 1). Using equation (1) weight of 2nd window is calculated as 0.25.1) Constructing Weighted Window TreeBy one scan of the data base with divided windows and weight, the Weighted Window Tree as shown in fig.2 can be constructed. Ellipses of the tree are called as page nodes, which consists of Page URL and rectangles are called as SID (Session ID) nodes which consist of SID and weight. There are 2 pointers for every page node. One pointer is directed towards the next page node and the other towards the successive SID nodes consisting of that Page URL [3]..Fig.2. WWT for the sample databaseOnce the tree is constructed, it has to be compressed to make it prepared for mining using user defined weighted minimum support (w ms)to obtain the recent and frequent informative pages.2)Tree compressionFrequent pagesets are mined based on user defined weighted minimum support W ms. Let W ms=1.5 for the sample relation. If individual page weight W p≥W ms, then the page is said to be frequent. By equation (2) and (3) the average page weights of the individual pages are calculated as shown below.W p1 = (2+1+3)/3 = 2W p2 = (0.75+0.5+1+3.5)/4 = 1.44W p3 = (1+1+2.5)/3 = 1.5W p4 = (7.5)/1=7.5.Pages p1, p3 and p4 are alone frequent-1 pages, as they have weights more than W ms. Hence the page node p2 and its attached SID nodes are removed from the tree since it is infrequent and the pointer is made to point directly to p3 from p1. This tree after reducing the infrequent page nodes and its relevant branches of weight nodes is called as compressed tree.3)Mining recently and frequently visited pagesets WWT method covers both user’s habit and interest. Habit is one which is regularly used (frequently used pages) and interest is one which changes according to time (recently used pages). High w ms leads to mining of less number of more recently used frequent pages, which shows probably users timely interest and low w ms leads to mining of more number of less recently used frequent pages, which shows users regular habits. More recently used pages can be kept in server’s cache to speed up web access.In literature WARM does not satisfy the downward closure property. It is not necessary that all the subsets of a frequent pageset should be frequent, because the logic of frequent pageset handles weighted support. But the proposed WWT method overrides this. For an m-pageset to be frequent, all the individual pages in the non zero records of m pageset have to be individually frequent. All the frequent-1 pages are put in a set named {F1} and the non empty subsets of it are obtained except the 1 element subsets and power set. For our example {F1}={p1,p3,p4} and such non empty subsets of it are {p1,p3}, {p3,p4} and {p1,p4}. The pageset weight W ps of all the non empty subsets obtained from {F1} is calculated using equation (3) and they are checked for whether frequent or not. For example for ps={p1,p3}, W p1p3=(1/n p1p3)(W p1+W p3). n p1p3 is the number of visitors who have visited both p1 and p3. It is found by the intersection of the session IDs of the visitors of p1 and p3. Hence {1,3,4}∩{2,3,4} gives {3,4} i.e. only the non zero entry records of all the pages of the pageset {p1,p3}. Two visitors (3rd and 4th visitor) have visited p1 and p3, out of a total of five visitors. Therefore W p1p3= (1+1+3+2.5)/2 = 3.75>1.5. Individual weights of p1 and p3 for 3rd and 4th visitor’s records are W p1=(1+3)/2=2>1.5 and W p3=(1+2.5)/2=1.75>1.5. Hence pageset {p1,p3} is a frequent pageset according to downward closure property. Similarly all other subsets have to be checked.4)WARMWeighted association rules are those strong rules obtained from the frequent pagesets by prescribing a user defined minimum weighted confidence C min. Let ‘x’ be one of the frequent pagesets and‘s’ a subset of x, then if W x/W s ≥ c min, then a rule of the form s ═> x-s is obtained. Let us consider for example the frequent pageset x={p1,p3} and s={p1}, the subset of x. If C min= 1.5, then since W p1p3/Wp1 = 3.75/2 = 1.875 >1.5, p1 ═> p3 is a strong association rule.IV.E XPERIMENTAL E VALUATIONFor evaluating the performance of the proposed method, experiments are performed on two different datasets. One is an EDI dataset (Educational Institution) and the other is msnbc dataset available in UCI machine learning repository from Internet Information Server (IIS) logs for . Comparison is made between weighted tree [3] and our proposed WWT methods, in terms of speed and space for various W ms. Both the data are extracted for one full day duration. Speed is calculated in terms of CPU execution time by including stubs in the program.Experiments were performed on an Intel Core I5, 3.2 GHz processor machine with 2GB RAM and 500 GB hard disk with Windows XP platform. WWT algorithm is implemented in Java.The proposed WWT method shows an enhanced performance with less execution time and less space than the Weighted Tree method and the experimental results are presented below from figure 3 to figure 6 for both datasets. It means it produces less number of frequent pagesets and hence takes less time (speed) and space. Comparison of the results for execution time (Speed) of both Weighted Tree (WT) and Weighted Window Tree (WWT) methods are shown in fig.3 and fig.4 and that for comparison of number of pagesets (Space) generated by both the methods are given in fig.5 and fig.6. It is seen that, as w ms increases the execution time and spacereduces more for WWT than for WT. WWT is also well scalable when input data set size increases.Fig.3. Comparison of Execution time for WT and WWT for msnbcdatasetFig.4. Comparison of Execution time for WT and WWT for EDI datasetFig.5. Comparison of Number of pagesets generated by WT and WWTfor msnbc datasetFig.6. Comparison of Number of pagesets generated by WT and WWTfor EDI datasetFig.3 implies that WWT has its ececution time reduced nearly by 50% than WT for msnbc dataset. Fig.4 implies that the execution time of WWT for EDI dataset has reduced one fourth than WT. This clearly reflects that the removal of insignificant pagesets by WWT reduces the execution time. Fig.5 and fig.6. reveal that number of frequent pagesets mined using WWT considerably reduces because of window weightage technique.V. C ONCLUSIONMethod for mining recent and frequent informative pages from web logs based on window weights and dwelling time of the pages is discussed. This utilizes Weighted Window Tree arrangement for Weighted Association Rule Mining and it is seen more efficient than Weighted Tree from experimental evaluation by means of speed and space. The system covers less recently used frequent pages from earlier stages and more recently used frequent pages from later stages.This method finds its main application in mining the web logs of educational institutions, to observe the surfing behaviour of the students. By mining the frequent pages using WWT, most significant pages and websites can be identified and useful informative pages can be accounted for taking future decisions. This technique can be used for mining any organizational weblog to know the employees browsing behaviour.The limitation of WWT lies in availing at a better weight allotting scheme which can reduce the execution time and number of pages mined even more better than WWT.R EFERENCES[1] Qiankun Zhao, Sourav S. Bhowmic, “Association R uleMining: A Survey” Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116 , 2003. [2] Han. J and M. Kamber(2004), “Data Mining Concepts andTechniques”: San Francisco, CA:. Morgan Kaufmann Publishers.[3] Preetham kumar and Ananthanara yana V S, “Discovery ofWeighted Association Rules Mining”, 978-1-4244-5586-7/10/$26.00 C 2010 IEEE, volume 5, pp.718 to 722.[4] W. Wang, J. Yang and P. Yu, “Efficient mining ofweighted association rules (WAR)”, Proc. of the ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 270-274, 2000.[5] F.Tao, F.Murtagh, M .Farid, “Weighted Association RuleMining using Weighted Support and Significance framework”, SIGKDD 2003.[6] Ke Sun and Fengshan Bai, “Mining weighted AssociationRules without Preassigned Weights”, IEEE Transactions on Knowledge and Data Engineering, Vol. 20, No. 4, pp.489-495, April 2008.[7] Hengshan Wang, Cheng Yang and Hua Zeng, “Design andImplementation of a Web Usage Mining Model Based on Fpgrowth and Prefixspan”, Communications of the IIMA, 2006 Volume 6 Issue2, pp.71 to 86.[8] V.Chitraa and Dr. Antony Selvadoss Davamani, “A Surveyon Preprocessing Methods for web Usage Data”, (IJCSIS)International Journal of Computer Science and Information Security,Vol. 7, No. 3, pp.78-83, 2010.[9]Rahul Mishra and Abha Choubey, “ Discovery of FrequentPatterns from web log Data by using FP Growth algorithm for web Usage Mining”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 9, pp.311-318, Sep 2012.[10]Renáta Iváncsy and István Vajk, “Frequent Pattern Miningin web Log Data”, Acta Polytechnica Hungarica Vol. 3, No.1, 77-90, 2006.[11]P.Velvadivu and Dr.K.Duraisamy, “An OptimizedWeighted Association Rule Mining on Dynamic Content”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 5, pp.16-19, March 2010.[12]Abhinav Srivastava, Abhijit Bhosale, and Shamik Sural,“Speeding Up web Access Using Weighted Association Rules”, S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 660–665, 2005. _c Springer-Verlag Berlin Heidelberg 2005.[13]Yiling Yang, Xudong Guan, Jinyuan You,”EnhancedAlgorithm for Mining the Frequently Visited Page Groups”, Shanghai Jiaotong University, China.[14]S.P.Syed Ibrahim and K.R.Chandran, “compact weightedclass association rule mi ning using information gain”, International Journal of Data Mining & knowledge Management Process (IJDKP) Vol.1, No.6, November 2011.[15]Vinod Kumar and Ramjeevan Singh Thakur, “High FuzzyUtility Strategy based Webpage Sets Mining from Weblog Database”, International Journal of Intelligent Engineering and Systems, Vol.11, No.1, pp.191-200, 2018. [16]K.Dharmarajan and Dr.M.A.Dorairangaswamy, “WebUsage Mining: Improve the User Navigation Pattern using FP-Growth algorithm”, Elysium Journal of Engineering Research and Management, Vol.3, Issue 4, August 2016.[17]Liu Kewen, “Analysis of preprocessing methods for webUsage Data”, 2012 International conference on measurement, Information and Control (MIC), School of Computer and Information Engineering, Harbin University of Commerce, China.Author’s ProfileDr.SP.Malarvizhi received the BEdegree in Electrical and ElectronicsEngineering from Annamalai University,India in 1994, ME degree in ComputerScience and Engineering from AnnaUniversity of Technology, Coimbatore,India in 2009 and received the Ph.D.degree in Data Mining from Anna University Chennai in 2016. She is currently working as an Associate Professor in CSE department at Sri Vasavi Engineering College, Andhra Pradesh since 2017. She has participated and published papers in many National and Internal Conferences and also published 7 papers in National and International journals. Her research interests are Data Mining, Big Data and Machine Learning.How to cite this paper: SP. Malarvizhi, " Recent and Frequent Informative Pages from Web Logs by Weighted Association Rule Mining", International Journal of Modern Education and Computer Science(IJMECS), Vol.11, No.10, pp. 41-46, 2019.DOI: 10.5815/ijmecs.2019.10.05。

An average linear time algorithm for web usage mining

An average linear time algorithm for web usage mining

An Average Linear Time Algorithm for WebUsage MiningJos´e BorgesSchool of Engineering,University of PortoR.Dr.Roberto Frias,4200-Porto,Portugaljlborges@fe.up.ptMark LeveneSchool of Computer Science and Information SystemsBirkbeck,University of LondonMalet Street,London WC1E7HX,U.K.mark@October22,2003AbstractIn this paper we study the complexity of a data mining algorithm for extracting patterns from user web navigation data that was proposed in previous work[3].The user web navigation sessions are inferred from log data and modeled as a Markov chain.The chain’s higher probability1trails correspond to the preferred trails on the web site.The algorithmimplements a depth-first search that scans the Markov chain for the highprobability trails.We show that the average behaviour of the algorithmis linear time in the number of web pages accessed.Keywords.Web usage mining,Markov chains,analysis of algorithms1IntroductionWeb usage mining is defined as the application of data mining techniques to discover user web navigation patterns in web log data[15].Logfiles provide a list of the page requests made to a given web server in which a request is characterised by,at least,the IP address of the machine placing the request, the date and time of the request,and the URL of the page requested.From this information it is possible to reconstruct the user navigation sessions within the web site[1],where a session consists of a sequence of web pages viewed by a user in a given time window.The web site owner can take advantage of web usage mining techniques to gain insights about the user behaviour when visiting the site and use the acquired knowledge to improve the design of the site.Two distinct directions are,in general,considered in web usage mining re-search.In thefirst,the user sessions are mapped onto relational tables and an adapted version of standard data mining techniques,such as mining association rules,is invoked,see for example[11].In the second approach,in which we situate our research,techniques are developed which can be invoked directly on the log data,see for example[3]or[14].2In[14]the authors propose a novel data mining model specific to analyze log data.A log data mining system is devised tofind patterns with predefined characteristics by means of a query language to be used by an expert.Several authors have proposed the use of Markov models to model user requests on the web.Pitkow et al.[12]proposed a longest subsequence model as an alterna-tive to the Markov model.Sarukkai[13]uses Markov models for predicting the next page accessed by the user.Cadez et al.[6]use Markov models for classifying browsing sessions into different categories.Deshpande et al.[7]pro-pose techniques for combining different order Markov models to obtain low state complexity and improved accuracy.Finally,Dongshan and Junyi[8]proposed an hybrid-order tree-like Markov model to predict web page access which pro-vides good scalability and high coverage.Markov models have been shown to be suited to model a collection of navigation records,where higher order models present increased accuracy but with a much larger number of states.In previous work we have proposed to model a collection of user web nav-igation sessions as a Hypertext Probabilistic Grammar(HPG).The HPG is a self-contained and compact model which is based on the well established theory of probabilistic grammars providing it with a sound foundation for future en-hancements such as the study of its statistical properties.In[3]we proposed an algorithm to extract the higher probability trails which correspond to the users preferred navigational trails.In this paper we provide a formal analysis of the algorithm’s complexity, which is afirst step in the direction of making the various web usage mining3models proposed in the literature comparable,since the diverse characteristics of the patterns induced by the various models makes such comparison difficult.The HPG model can alternatively be seen as an absorbing Markov chain[9]. Since the HPG concept is not essential for the algorithm’s complexity analysis, herein,for simplicity,we will treat our model in terms of an absorbing Markov chain.For details on the HPG model we refer the reader to[2,3,4,5].In Section2we briefly present the proposed model and the depth-first algo-rithm forfinding patterns in the data.In Section3we present the main result of this paper proving that the algorithm has average linear time complexity.2Markov Chains to Model User Navigation Ses-sions2.1Building the Model from the Navigation SessionsIn this section we will introduce the proposed Markov model by means of an example.The model is inferred from the collection of sessions.Consider a web site with three pages,{A,B,C},and the following collection of user navigation sessions:(1)A,B,C,B(2)A,C,B,C,C(3)B,C,B(4)C,A,B.Each navigation session gives a sequence of pages viewed by a user.To each web page visited we create a corresponding state in the model.Moreover,there are two additional states:a start state S representing the first state of every navigation session and afinal state F representing the last4state of every navigation session.Figure1(a)shows the Markov chain modelling just thefirst session.There is a transition corresponding to each sequence of two pages in the session,a transition from the start state to thefirst state of the session,and a transition from the last state of the session to thefinal state.The probability of a transition is estimated by the ratio of the number of times the corresponding sequence of pages was traversed and the number of times the anchor page was visited.The model is incrementally build by processing the complete set of naviga-tion sessions.Figure1(b)shows the model for the entire collection of sessions given in the example.(a)(b)Figure1:The process of building the Markov model.Figure2shows the Markov chain corresponding to the example represented in Figure1(b).The Markov chain is defined by a set of states X,a transition matrix T,and a vector of initial probabilitiesπ.The set of states,X,is com-posed by the start state,S,thefinal state,F,and the states that correspond5to the web pages visited.The transition matrix records the transition probabil-ities which are estimated by the proportion of times the corresponding link was traversed from the anchor.The initial probability of a state is estimated as the proportion of times the corresponding page was requested by the user.Therefore,according to the model’s definition,apart from thefinal state,all states have a positive initial probability.In[3]we make use of a parameter by which we can tune the model to be between a scenario where the initial probabilities are proportional to the number of times a page has been requested as thefirst page and a scenario where the probabilities are proportional to the number of times the page has been requested.For this paper we have adopted the latter scenario which enables us to identify sequences of pages which were frequently followed but were not necessarily at the beginning of a user navigation session.From the method given to infer the model we note that every state in X(i.e., a state present in at least one navigation session)is included in at least one path from state S to state F.Since thefinal state F does not have out-transitions and it is reachable from every other state,the state F is an absorbing state and, therefore,the Markov chain is an absorbing chain.As described,the model assumes that the probability of a hypertext link being chosen depends solely on the contents of the page being viewed.Several authors have shown that models that make use of such an assumption are able to provide good accuracy when predicting the next link the user will choose to follow,see for example[13].In addition,this assumption can be relaxed by6making use of the N gram concept[3],or the dynamic Markov chain concept [10].Since the application of the two referred concepts results in a model with the same properties as the model described herein we refer the reader to[3]and [10]for more detail.X={A,B,C,F}π=<3/15,6/15,6/15,0>T=A B C FA02/31/30B001/21/2C1/63/61/61/6F0001Figure2:The Markov chain corresponding to the example.The Markov chain inferred from the log data summarises the user interaction with the web site and the aim is to identify patterns in the navigation behaviour. Definition1(Trail)We define a trail as afinite sequence of states that are accessed in the order of their traversal in the underlying web site.According to the model proposed,the probability of a trail is estimated by the product of the initial probability of thefirst state in the trail and the transition probabilities of the enclosed transitions.For example,the estimated probability of trail A,B,C,F is3/15·2/3·1/2·1/6=6/540.Note that,a trail induced by the model does not necessarily have to end in the state F. The probability estimated for trail A,B,C,which is3/15·2/3·1/2=6/90, gives the probability of A,B,C as a prefix of other trails,and the probability estimated for A,B,C,F gives the probability of a user terminating the session7after following the trail A,B,C.Definition2(Navigation patterns)A set of navigation patterns is defined to be the set of trails whose estimated probability is above a specified cut-point,λ∈(0,1).We define the cut-point,λ,to be composed of two distinct thresholds,with λ=θδ,whereθ∈(0,1)is the support threshold andδ∈(0,1)the confidence threshold.This decomposition is adopted in order to facilitate the specification of the cut-point value.In fact,since the model assumes that every state has a positive initial probability,the values of the probabilities in the vector of initial probabilities are of a much smaller order of magnitude than the values of the transition probabilities.Therefore,when setting the cut-point value we recommend the analyst to view the support threshold as the factor responsible for pruning out the states whose initial probability is low,corresponding to a subset of the web site rarely visited.Similarly,we recommend to view the confidence as the factor responsible for pruning out trails containing transitions with low probability.One difficulty that arises with models such as the one described herein is how to set the value of the parameters.The idea behind decomposing the cut-point into two components is to provide the analyst with some insight on how to set the parameter’s value.For example,if the support threshold is set in a way that takes into account the number of states in the model,i.e.a model with n states having the support set toθ=1/n,it would mean that only pages which were visited a number of times above the average will be considered as being8initial states of a trail.Similarly,in order to set the value of the confidence threshold the analyst could take into account the web site branching factor,that is,the average num-ber of out-links per page.For example,if the model has on average t out-links per page the average transition probability is1/t.Therefore,if the analyst aims to identify trails composed by transitions whose estimated probability is greater than1/t and with length m,the confidence threshold should be set to δ=(1/t)(m−1).The factor(m−1)is necessary because thefirst state in a trail is obtained by a transition from the start state which is taken into account by the support threshold.Two other algorithms that make use of other cut-point definitions in order to identify different types of patterns were proposed in[4] and[5].Assuming a support threshold of1/n,in the context of a set of con-trolled experiments,it is possible to vary the overall value of the cut-point by varying its confidence component.2.2The AlgorithmThe algorithm proposed forfinding the set of all trails with probability above a specified cut-point consists of a generalisation of a depth-first search,[16].An exploration tree is built with the start state as its root wherein each branch of the tree is explored until its probability falls below the cut-point.Definition3(Rule)A branch with probability above the cut-point and with no extensions leading to a longer branch with probability above the cut-point is called a rule.9While a branch in the tree corresponds to a trail users may follow in the web site,a rule corresponds to a trail that has,according to past behaviour,high probability of being followed by the users.Note that we only include maximal trails in the induced rule-set,RS,where a trail is maximal if it is not a proper prefix of any other trail in the set.All non-maximal trails that have probability above the cut-point are implicitly given by one,or more than one,maximal trail. For example,if a trail X,Y,Z is maximal(i.e.,cannot be augmented)the non-maximal trail X,Y also has its probability above the cut-point,however it will not be included in the rule-set because it is implicitly given by the corresponding maximal trail.We now give the pseudo-code for our depth-first search algorithm.We let X be the set of states,|X|be its cardinality,and X i,1≤i≤|X|,represent a state.Moreover,wX i represents a trail being evaluated.For a trail composed by m states,w represents the prefix-trail composed by thefirst m−1states and X i represents the m th state which is the tip of the trail being evaluated.We let p(wX i)represent a trail’s probability.The concatenation operation,wX i+X j, appends the state X j to the trail wX i,resulting in the trail wX i X j with X j being its tip.Also,T i,j represents the probability of a transition from state X i to state X j,πi the initial probability of state X i,and RS represents the set of rules being induced.The transition matrix is implemented as a linked list in such way that each state has its out-transitions represented by a list of the states which according to the user’s navigation records are reachable from it in one step.Links that were not traversed and have an estimated probability10of0are not kept in the list.The notation,X i→lst represents the access to thefirst state from the list of states that have a transition from state i,ptr represents a pointer to the state from the list that is currently being handled, and ptr=ptr→next assigns to ptr the next state in the list.Finally,X ptr represents the state currently indicated by a pointer.Algorithm1.1(DFSmining(λ))1.begin2.for i=1to|X|3.ifπi>λthen Explore(X i,πi);4.end for5.end.Algorithm1.2(Explore(wX i,p(wX i)))1.begin2.flag=false;3.ptr=X i→lst;4.while(ptr!=null)5.if p(wX i)·T i,ptr>λthen6.Explore(wX i+X ptr,p(wX i)·T i,ptr);7.flag=true;8.end if9.ptr=ptr→next;10.end while11.if flag=false then RS=RS∪{wX i};12.end.Figure3illustrates the exploration tree induced when the algorithm with the cut-point set toλ=0.11is applied to the example in Figure2.A plain line indicates the composition of the maximal trails and a dashed line corresponds to a transition whose inclusion in the trail being explored would lead to a trail with probability below the cut-point.11Figure3:The exploration tree induced by the algorithm when applied to the example on Figure2withλ=0.11.2.3Experimental EvaluationExperiments were conducted on randomly generated Markov chains in order to assess the performance of the proposed algorithm.For a complete description of the method used to generate the random data and full presentation of experi-ment results we refer the reader to[2]and[3].Herein we confine the presentation to the results that lead the authors to the pursuit of a formal complexity analysis of the algorithm’s performance.Figure4shows the variation of the number of operations performed with the number of states in the model for different values of the cut-point.The results shown correspond to models with an average number offive out-links per page; we call the average number of out-links per page the model’s branching factor (or simply BF).Wefixed the support threshold to beθ=1/n and,therefore,12in the figure the cut-point is indicated by the value of its confidence threshold component.We define an operation as the evaluation of a link when constructing the exploration tree.For each configuration of the model 30runs were performed and the results shown correspond to the average number of operations for 30runs.We note that similar behaviour was observed on real data sets [3].These results suggest that the algorithm has average linear time behaviour.02040608010012014016005000100001500020000N u m . o p e r a t i o n s (x 1000)Num. states DFSmining algorithm (BF=5)Conf.=0.3Conf.=0.5Conf.=0.7Figure 4:Variation of the number of operations with the model’s number of states.133Analysis of the Algorithm’s Average Complex-ityWe now give our analysis of the average case complexity for the proposed al-gorithm.We measure the algorithm’s complexity by the number of operations, where an operation is defined as a link evaluation in the exploration tree.Consider a Markov chain with n=|X|states,t transitions between states having probability greater than0,and cut-pointλ.The model’s branching factor,BF,is the average number of out-transitions per state and is given by t/n.Also,we define the length of a trail as the number of states it contains.Given a model,the average number of trails starting in a particular state and having a given length,∆,corresponds to the number of distinct branches in a BF-ary tree whose depth is equal to∆.Thus,since the navigation can begin at any of the n states,we can estimate the average number of trails having a given length,∆,denoted by E(#w∆),by the following expression:E(#w∆)=n·BF(∆−1)=ntn(∆−1).Given that every state has positive initial probability the average initial probability of a state is1/n.Similarly,the average transition probability be-tween two states is1/BF=n/t.Therefore,the average probability of a trail with length∆,E(p(w∆)),will beE(p(w∆))=1nnt(∆−1).By making E(p(w∆))=λwe can estimate the average trail length for the given cut-point value,∆λ,as14∆λ=ln(λn)ln(n t)+1.Note that the algorithm augments trails until their probability fall below the cut-point,therefore,on average the probability of the maximal trails inferred is close toλ.Given thatλ=θδand assumingθ=1/n we have∆λ=ln(δ)−ln(BF)+1.(1)This last expression shows that for a given confidence threshold and branch-ing factor,the estimated average rule length is constant and independent of the number of states.There is a special case to consider when BF=1,which occurs if all existing trails from S to F are disjoint and with no recurrent states. In this case,E(p(w∆))=1/n and∆λis given by the average length of all the trails from S to F.In general,since every state has at least one out-transition BF≥1,therefore,ifθ=1/n it follows that∆λ≥1.In addition,for a givenδthe average number of trails with length∆λisE(#w∆λ)=ntn(∆λ−1)=ntnln(δ)−ln(t/n)=nδ,(2)since(t n )ln(δ)−ln(t/n)=e−ln(δ)=1δ.Expression(2)gives the average number of rules for a given cut-point.Intu-itively,(2)follows from the fact that1/δgives an upper bound on the number of trails we can pack intoδfor each of the n states.15We will now determine the average number of operations necessary to induce the rule-set RS for a given cut-point.As before we define an operation as the evaluation of a link.The for loop in Algorithm1.1performs one operation per state,in a total of n operations.For each state Algorithm1.2is invoked.In Algorithm1.2there is a while loop that recursively calls the algorithm from line(6).In the worst-case analysis,each recursive call of the algorithm evaluates all the out-transitions from the tip of the trail,and therefore,for each invocation of the while loop the algorithm performs on average BF operations. Moreover,in order to induce a rule-set for a Markov chain with n states,each of the n states needs to have its out-transitions evaluated,therefore,the average rule length,∆λ,corresponds to the depth of recursion.Finally,the average number of operations performed,denoted by E(#O n),is bounded below and above by1δ≤E(#O n)n≤∆λi=0BF i=1−BF ∆λ +11−BF(3)were ∆λ is the ceiling of∆λ.We are now ready to present the analysis of the average complexity of the algorithm.Theorem Given afixed support,θ=O(1/n),and confidence,δ,the average number of operations needed tofind all the trails having probability above the cut-point,λ=θδ,varies linearly with the number of states in the absorbing Markov Chain.16Proof.Consider a Markov chain with n states,t transitions between states,λ=θδand assumeθ=1/n.From their definitionsδand BF are independent of n.Also,it follows from(1)that∆λis independent of n.Therefore,it follows from(3)that the number of operations depends linearly of n.2 Following the analysis,we can state that the worst-case complexity of the algorithm occurs when the average trail length is maximal.Assuming that T max is the maximal probability of a transition in the Markov chain,for T max<1we can derive the maximum for the average trail length as∆Tmax =ln(λn)ln(T max)+1.To obtain the number of operations corresponding to the worst-case just replace∆λby∆Tmaxin Equation3.We now illustrate the result by means of an example.Consider a model having n=5,BF=2andλ=θδ=0.05whereδis taken to be0.25andθto be1/5.The average case of such model consists of a Markov chain in which every state has exactly two out-transitions and every transition has probability 0.5.Thus,the estimate for the average trail length is given by∆λ =ln(δ)−ln(BF)+1=3To induce the rule-set the algorithm constructs the exploration tree rep-resented in Figure5.In thefigure,plain lines indicate links that are part of trails with probability above the cut-point(i.e.rules),dotted lines indicate links whose inclusion in the trail being evaluated would lead to a trail with probabil-ity below the cut-point and the lines represented by a dots and dashes represent an exploration tree similar to the one detailed.Finally,the numbers next to the17links indicate the order in which links are evaluated.Figure5indicates the number of operations(link evaluations)performed by the algorithm in order to induce the set of maximal trails.Thefirst link to be evaluated is the transition from the start state to state n1,which has probability 1/5.Then,transitions are recursively evaluated in a depth-first scheme until the trail’s probability falls below the threshold.Since each state has two out-transitions with equal probability the induced trails are composed by just two transitions.Therefore,in order to induce the maximal trails that start from state n1we need to perform15operations,and75operations are needed to induce the complete set of maximal trails.Figure5:Example of the exploration tree resulting from the algorithm.The average case analysis for the example givesE(#O n)≤n1−BF ∆λ +1=751−BF18which corresponds tofive times the sub-tree detailed in Figure5.Finally,the worst-case analysis depends of the T max value as shown in Table1.Table1:The variation of the worst-case analysis with the T max value.T max0.60.70.80.913.74.97.214.2∞∆TmaxE(#O n)1553152555327675∞4Concluding RemarksSeveral authors have been studying the problem of mining web usage patterns from log data.Patterns inferred from past user navigation behaviour in a site are useful to provide insight on how to improve the web site design’s structure and to enable a user to prefetch web pages that he is likely to download.In previous work we have proposed to model users’navigation records,inferred from log data,as a hypertext probabilistic grammar and an algorithm tofind the higher probability trails which correspond to the users’preferred web navigation trails. In this paper we present a formal analysis of the algorithm’s complexity and show that the algorithm presents on average linear behaviour in the number of web pages accessed.In the literature there are several other web usage mining algorithms,however,comparison is not always possible due to the diversity of the assumptions made in the various models.Providing the average complexity analysis of our algorithm is a step in the direction of making the web different usage mining approaches comparable.19As future work we mention the study of the form of the probability dis-tribution which characterises user navigation behaviour and the effect of such a distribution on the complexity analysis.Also,we aim to explore dynamic Markov chains and state cloning in the context of web usage mining[10].Fi-nally,we plan to conduct a study to evaluate the usefulness of the web usage patterns to the user and to incorporate relevance measures into the model. Acknowledgements.The authors would like to thank the referees for several comments and suggestions to improve the paper.References[1]Bettina Berent,Bamshad Mobasher,Myra Spiliopoulou,and Jim Wilt-shire.Measuring the accuracy of sessionizers for web usage analysis.In Proceedings of the Web Mining Workshop at the First SIAM International Conference on Data Mining,pages7–14,Chicago,April2001.[2]Jos´e Borges.A Data Mining Model to Capture User Web Navigation.PhDthesis,University College London,London Uiversity,2000.[3]Jos´e Borges and Mark Levene.Data mining of user navigation patterns.In Brij Masand and Myra Spliliopoulou,editors,Web Usage Analysis and User Profiling,Lecture Notes in Artificial Intelligence(LNAI1836),pages 92–111.Springer Verlag,Berlin,2000.20[4]Jos´e Borges and Mark Levene.Afine grained heuristic to capture webnavigation patterns.SIGKDD Explorations,2(1):40–50,2000.[5]Jos´e Borges and Mark Levene.A heuristic to capture longer user webnavigation patterns.In Proceedings of thefirst International Conference on Electronic Commerce and Web Technologies,pages155–164,Greenwich, U.K.,September2000.[6]I.Cadez,D.Heckerman,C.Meek,P.Smyth,and S.White.Visualizationof navigation patterns on a web site using model based clustering.In Proceedings of the6th KDD conference,pages280–284,2000.[7]Mukund Deshpande and George Karypis.Selective markov models forpredicting web-page accesses.In Proc.of the1st SIAM Data Mining Con-ference,April2001.[8]Xing Dongshan and Shen Junyi.A new markov model for web accessputing in Science&Engineering,4(6):34–39,2002.[9]John G.Kemeny and urie Snell.Finite Markov Chains. D.VanNostrand,Princeton,New Jersey,1960.[10]Mark Levene and George puting the entropy of user navigationin the web.International Journal of Information Technology and Decision Making,2:459–476,2003.[11]Bamshad Mobasher,Honghua Dai,Tao Luo,and Miki -ing sequential and non-sequential patterns for predictive web usage mining21tasks.In Proceedings of the IEEE International Conference on Data Min-ing,pages669–672,Japan,December2002.[12]James Pitkow and Peter Pirolli.Mining longest repeating subsequences topredict world wide web surfing.In Proc.of the Second Usenix Symposium on Internet Technologies and Systems,Colorado,USA,October1999. [13]R.Sarukkai.Link prediction and path analysis using markov chains.InProceedings of the9th Int.WWW conference,2000.[14]M.Spiliopoulou and C.Pohle.Data mining for measuring and improvingthe success of web sites.Data Mining and Knowledge Discovery,5:85–114, 2001.[15]Jaideep Srivastava,Robert Cooley,Mukund Deshpande,and Pang-NingTan.Web usage mining:Discovery and applications of usage patterns from web data.SIGKDD Explorations,1(2):1–12,January2000.[16]Robert Tarjan.Depth-first search and linear graph algorithms.SIAMJournal on Computing,1(2):146–160,June1972.22。

基于Web-Log Mining寻找目标网页最优期望定位

基于Web-Log Mining寻找目标网页最优期望定位

基于Web-Log Mining寻找目标网页最优期望定位
丛蓉;王秀坤;吴军;周岩
【期刊名称】《计算机工程与应用》
【年(卷),期】2004(040)034
【摘要】为了优化网站的访问效能,实现网站实际结构与用户的使用行为相吻合,该文主要应用Web挖掘技术,以网站的服务器Web日志作为数据源,使用算法FEL 和算法CRLL从用户访问事务序列中寻找目标网页的期望定位,并以最少"后退"次数为原则生成推荐链接列表.网站设计者可根据该列表,修改网页之间的链接关系,达到减少对目标网页搜索时间的目的.
【总页数】4页(P151-153,178)
【作者】丛蓉;王秀坤;吴军;周岩
【作者单位】大连理工大学电子与信息技术学院,大连,116024;海军大连舰艇学院教育技术中心,大连,116018;大连理工大学电子与信息技术学院,大连,116024;海军大连舰艇学院教育技术中心,大连,116018;海军大连舰艇学院教育技术中心,大连,116018
【正文语种】中文
【中图分类】TP393
【相关文献】
1.基于最差性能最优的水下目标定位方法研究 [J], 周旋;邹海英;唐弢;孟玲龙;路喜平;宋海岩
2.基于随机期望值模型的时差定位最优布站算法 [J], 王旭赢;刘雅奇;郑灿
3.基于非线性最优目标函数的红外云图台风中心自动定位 [J], 陈希;李妍;毛科峰;费树岷
4.基于目标高斯分布的定位系统节点最优部署方法 [J], 周荣艳;陈建峰;李晓强;谭伟杰
5.一种基于双目视觉传感器的遮挡目标期望抓取点识别与定位方法 [J], 王煜升;张波涛;吴秋轩;吕强
因版权原因,仅展示原文概要,查看原文内容请购买。

英文web表达的中文是什么意思

英文web表达的中文是什么意思

英文web表达的中文是什么意思英文单词web所表达的中文意思英 [web] 美 [wɛb]名词蜘蛛网,网状物; [机]万维网; 织物; 圈套及物动词在…上织网; 用网缠住; 使中圈套; 形成网状名词1. The spider spins its web.蜘蛛结网。

2. Ducks are web-footed to help them move through the water.鸭子脚上有蹼,便于游泳。

3. His story was a web of lies.他的话是一套谎言。

4. Who can understand the web of life?谁能弄懂这错综复杂的人生?英文单词web的单语例句1. Web security problems have disrupted the world's business operations from time to time.2. Web security problems have disrupted global business operations from time to time.3. They can publish the registration number for their business charters on the Web page of their online stores.4. The summit provided a platform for hundreds of business people to discuss Internet adoption among enterprises as well as the trend of Web use.5. LOS ANGELES - George Clooney personally responded Thursday to rumors circulated by two gossip Web sites.6. The foundation notified all donors by letter within the last 10 days to let them know their names would be published on its Web site.7. But the Web network was disintegrated by the end of 2004 for no specific reason.8. A report by the China Financial Certification Authority said that 60 percent of Web surfers would not make deals via online bankingfor security concerns.9. The campaign said the ad would run on national cabletelevision networks and the campaign's Web site.10. An online tracking firm in California said Web sites specialising in gifts and flowers had nearly trebled from this time last year.英文单词web的双语例句1. According to the NHTSA`s Web site, the recall is due to a faulty drive shaft cover plate that could lead to a detached drive shaft.根据美国国家公路交通安全局的网站,此次召回是由于错误的传动轴盖板可能导致脱离传动轴。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Different Aspects of Web Log MiningRenáta Iváncsy, István VajkDepartment of Automation and Applied Informatics,and HAS-BUTE Control Research GroupBudapest University of Technology and EconomicsGoldmann Gy. tér 3, H-1111 Budapest, Hungarye-mail: {renata.ivancsy, vajk}@aut.bme.huAbstract: The expansion of the World Wide Web has resulted in a large amount of data that is now freely available for user access. The data have to be managed and organized in such a way that the user can access them efficiently. For this reason the application of data mining techniques on the Web is now the focus of an increasing number of researchers. One key issue is the investigation of user navigational behavior from different aspects. For this reason different types of data mining techniques can be applied on the log file collected on the servers. In this paper three of the most important approaches are introduced for web log mining. All the three methods are based on the frequent pattern mining approach. The three types of patterns that can be used for obtain useful information about the navigational behavior of the users are page set, page sequence and page graph mining.Keywords: Pattern mining, Sequence mining, Graph Mining, Web log mining1 IntroductionThe expansion of the World Wide Web (Web for short) has resulted in a large amount of data that is now in general freely available for user access. The different types of data have to be managed and organized such that they can be accessed by different users efficiently. Therefore, the application of data mining techniques on the Web is now the focus of an increasing number of researchers. Several data mining methods are used to discover the hidden information in the Web. However, Web mining does not only mean applying data mining techniques to the data stored in the Web. The algorithms have to be modified such that they better suit the demands of the Web. New approaches should be used which better fit the properties of Web data. Furthermore, not only data mining algorithms, but also artificial intelligence, information retrieval and natural language processing techniques can be used efficiently. Thus, Web mining has been developed into an autonomous research area.The focus of this paper is to provide an overview how to use frequent pattern mining techniques for discovering different types of patterns in a Web log database. The three patterns to be searched are frequent itemsets, sequences and tree patterns. For each of the problem an algorithm was developed in order to discover the patterns efficiently. The frequent itemsets (frequent page sets) are discovered using the ItemsetCode algorithm presented in [1]. The main advantage of the ItemsetCode algorithm is that it discovers the small frequent itemsets in a very quick way, thus the task of discovering the longer ones is enhanced as well. The algorithm that discovers the frequent pase sequences is called SM-Tree [2] and the algorithm that discovers the tree-like patters is called PD-Tree [3]. Both of the algorithms exploit the benefit of using automata theory approach for discovering the frequent patterns. The SM-Tree algorithm uses state machines for discovering the sequences, and the PD-Tree algorithm uses pushdown automatons for determining the support of the tree patterns in a tree database.The organization of the paper is as follows. Section 2 introduces the basic tasks of Web mining. In Section 3 the Web usage mining is described in detail. The different tasks in the process of Web usage mining is depicted as well. Related Work can be found in Section 4 and the preprocessing steps are described in Section 5. The results of the mining process can be found in Section 6.2 Web Mining ApproachesWeb mining involves a wide range of applications that aims at discovering and extracting hidden information in data stored on the Web. Another important purpose of Web mining is to provide a mechanism to make the data access more efficiently and adequately. The third interesting approach is to discover the information which can be derived from the activities of users, which are stored in log files for example for predictive Web caching [4]. Thus, Web mining can be categorized into three different classes based on which part of the Web is to be mined [5,6,7]. These three categories are (i) Web content mining, (ii) Web structure mining and (iii) Web usage mining. For detailed surveys of Web mining please refer to [5,6,8,9].Web content mining [10,9] is the task of discovering useful information available on-line. There are different kinds of Web content which can provide useful information to users, for example multimedia data, structured (i.e. XML documents), semi-structured (i.e. HTML documents) and unstructured data (i.e. plain text). The aim of Web content mining is to provide an efficient mechanism to help the users to find the information they seek. Web content mining includes the task of organizing and clustering the documents and providing search engines for accessing the different documents by keywords, categories, contents etc.Web structure mining [11,12,13,14] is the process of discovering the structure of hyperlinks within the Web. Practically, while Web content mining focuses on the inner-document information, Web structure mining discovers the link structures at the inter-document level. The aim is to identify the authoritative and the hub pages for a given subject. Authoritative pages contain useful information, and are supported by several links pointing to it, which means that these pages are highly-referenced. A page having a lot of referencing hyperlinks means that the content of the page is useful, preferable and maybe reliable. Hubs are Web pages containing many links to authoritative pages, thus they help in clustering the authorities. Web structure mining can be achieved only in a single portal or also on the whole Web. Mining the structure of the Web supports the task of Web content mining. Using the information about the structure of the Web, the document retrieval can be made more efficiently, and the reliability and relevance of the found documents can be greater. The graph structure of the web can be exploited by Web structure mining in order to improve the performance of the information retrieval and to improve classification of the documents.Web usage mining is the task of discovering the activities of the users while they are browsing and navigating through the Web. The aim of understanding the navigation preferences of the visitors is to enhance the quality of electronic commerce services (e-commerce), to personalize the Web portals [15] or to improve the Web structure and Web server performance [16]. In this case, the mined data are the log files which can be seen as the secondary data on the web where the documents accessible through the Web are understood as primary data. There are three types of log files that can be used for Web usage mining. Log files are stored on the server side, on the client side and on the proxy servers. By having more than one place for storing the information of navigation patterns of the users makes the mining process more difficult. Really reliable results could be obtained only if one has data from all three types of log file. The reason for this is that the server side does not contain records of those Web page accesses that are cached on the proxy servers or on the client side. Besides the log file on the server, that on the proxy server provides additional information. However, the page requests stored in the client side are missing. Yet, it is problematic to collect all the information from the client side. Thus, most of the algorithms work based only the server side data. Some commonly used data mining algorithms for Web usage mining are association rule mining, sequence mining and clustering [17].3 Web Usage MiningWeb usage mining, from the data mining aspect, is the task of applying data mining techniques to discover usage patterns from Web data in order to understand and better serve the needs of users navigating on the Web [18]. Asevery data mining task, the process of Web usage mining also consists of three main steps: (i) preprocessing, (ii) pattern discovery and (iii) pattern analysis.In this work pattern discovery means applying the introduced frequent pattern discovery methods to the log data. For this reason the data have to be converted in the preprocessing phase such that the output of the conversion can be used as the input of the algorithms. Pattern analysis means understanding the results obtained by the algorithms and drawing conclusions.Figure 1 shows the process of Web usage mining realized as a case study in this work. As can be seen, the input of the process is the log data. The data has to be preprocessed in order to have the appropriate input for the mining algorithms. The different methods need different input formats, thus the preprocessing phase can provide three types of output data.The frequent patterns discovery phase needs only the Web pages visited by a given user. In this case the sequences of the pages are irrelevant. Also the duplicates of the same pages are omitted, and the pages are ordered in a predefined order.In the case of sequence mining, however, the original ordering of the pages is also important, and if a page was visited more than once by a given user in a user-defined time interval, then it is relevant as well. For this reason the preprocessing module of the whole system provides the sequences of Web pages by users or user sessions.For subtree mining not only the sequences are needed but also the structure of the web pages visited by a given user. In this case the backward navigations are omitted, only the forward navigations are relevant, which form a tree for each user. After the discovery has been achieved, the analysis of the patterns follows. The whole mining process is an iterative task which is depicted by the feedback in Figure 1. Depending on the results of the analysis either the parameters of the preprocessing step can be tuned (i.e. by choosing another time interval to determine the sessions of the users) or only the parameters of the mining algorithms. (In this case that means the minimum support threshold.)In the case study presented in this work the aim of Web usage mining is to discover the frequent pages visited at the same time, and to discover the page sequences visited by users. The results obtained by the application can be used to form the structure of a portal satisfactorily for advertising reasons and to provide a more personalized Web portal.Figure 1Process of Web usage mining4 Related WorkIn Web usage mining several data mining techniques can be used. Association rules are used in order to discover the pages which are visited together even if they are not directly connected, which can reveal associations between group of users with specific interest [15]. This information can be used for example for restructuring Web sites by adding links between those pages which are visited together. Association rules in Web logs are discovered in [19,20,21,22,23]. Sequence mining can be used for discover the Web pages which are accessed immediately after another. Using this knowledge the trends of the activity of the users can be determined and predictions to the next visited pages can be calculated. Sequence mining is accomplished in [16], where a so-called WAP-tree is used for storing the patterns efficiently. Tree-like topology patterns and frequent path traversals are searched by [19,24,25,26].Web usage mining is elaborated in many aspects. Besides applying data miningtechniques also other approaches are used for discovering information. Forexample [7] uses probabilistic grammar-based approach, namely an Ngram model for capturing the user navigation behavior patterns. The Ngram model assumes that the last N pages browsed affect the probability of identifying the next page to be visited. [27] uses Probabilistic Latent Semantic Analysis (PLSA) to discover the navigation patterns. Using PLSA the hidden semantic relationships among users and between users and Web pages can be detected. In [28] Markov assumptions are used as the basis to mine the structure of browsing patterns. For Web prefetching [29] uses Web log mining techniques and [30] uses a Markov predictor.5 Data PreprocessingThe data in the log files of the server about the actions of the users can not be used for mining purposes in the form as it is stored. For this reason a preprocessing step must be performed before the pattern discovering phase.The preprocessing step contains three separate phases. Firstly, the collected data must be cleaned, which means that graphic and multimedia entries are removed. Secondly, the different sessions belonging to different users should be identified.A session is understood as a group of activities performed by a user when he is navigating through a given site. To identify the sessions from the raw data is a complex step, because the server logs do not always contain all the information needed. There are Web server logs that do not contain enough information to reconstruct the user sessions, in this case for example time-oriented heuristics can be used as described in [31]. After identifying the sessions, the Web page sequences are generated which task belongs to the first step of the preprocessing. The third step is to convert the data into the format needed by the mining algorithms. If the sessions and the sequences are identified, this step can be accomplished more easily.In our experiments we used two web server log files, the first one was the anonymous data1 and the second one was a Click Stream data downloaded from the ECML/PKDD 2005 Discovery Challenge2. Both of the log files are in different formats, thus different preprocessing steps were needed.The msnbc log data describes the page visits of users who visited on September 28, 1999. Visits are recorded at the level of URL category and are recorded in time order. This means that in this case the first phase of the preprocessing step can be omitted. The data comes from Internet Information Server (IIS) logs for . Each row in the dataset corresponds to the page1 /databases/msnbc/msnbc.html2 http://lisp.vse.cz/challenge/CURRENT/visits of a user within a twenty-four hour period. Each item of a row corresponds to a request of a user for a page. The pages are coded as shown in Table 1. The client-side cached data is not recorded, thus this data contains only the server-side log.Table 1Codes for the page categoriescategory code category code category codefrontpage 1 misc 7 summary 13news 2 weather 8 bbs 14tech 3 health 9 travel 1516msn-newslocal 4 living 10opinion 5 business 11 msn-sport 17On-air 6 sports 12In the case of the msnbc data only the rows have to be converted into itemsets, sequences and trees. The other preprocessing steps are done already. A row is converted into an itemset by omitting the duplicates of the pages, and sorting them regarding their codes. In this way the ItemsetCode algorithm can be executed easily on the dataset.In order to have sequence patterns the row has to be converted such that they represent sequences. A row corresponds practically to a sequence having only one item in each itemset. Thus converting a row into the sequence format needed by the SM-Tree algorithm means to insert a -1 between each code.In order to have the opportunity mining tree-like patterns the database has to be converted such that the transactions represent trees. For this reason each row is processed in the following way. The root of the tree is the first item of the row. From the subsequent items a branch is created until an item is reached which was already inserted into the tree. In this case the algorithm inserts as many -1 item into the string representation of the tree as the number of the items is between the new item and the previous occurrence of the same item. The further items form another branch in the tree. For example given the row: “1 2 3 4 2 5” then the tree representation of the row is the following: “1 2 3 4 -1 -1 5”.In case of the Click Stream data, the preprocessing phase needs more work. It contains 546 files where each file contains the information collected during one hour from the activities of the users in a Web store. Each row of the log contains the following parts:• a shop identifier•time•IP address•automatic created unique session identifier•visited page•referrerIn Figure 2 a part of the raw log file can be observed. Because in this case the sessions have already been identified in the log file, the Web page sequences for the same sessions have to be collected only in the preprocessing step. This can be done in the different files separately, or through all the log files. After the sequences are discovered, the different web pages are coded, and similarly to the msnbc data, the log file has to be converted into itemsets and sequences.Figure 2An example of raw log file6 Data Mining and Pattern AnalysisAs it is depicted in Figure 1, the Web usage mining system is able to use all three frequent pattern discovery tasks described in this work. For the mining process, besides the input data, the minimum support threshold value is needed. It is one of the key issues, to which value the support threshold should be set. The right answer can be given only with the user interactions and many iterations until the appropriate values have been found. For this reason, namely, that the interaction of the users is needed in this phase of the mining process, it is advisable executing the frequent pattern discovery algorithm iteratively on a relatively small part of the whole dataset only. Choosing the right size of the sample data, the response time of the application remains small, while the sample data represents the whole data accurately. Setting the minimum support threshold parameter is not a trivial task, and it requires a lot of practice and attention on the part of the user.The frequent itemset discovery and the association rule mining was accomplished using the ItemsetCode algorithm with different minimum support and minimum confidence threshold values. Figure 3 (a) depicts the association rules generated from data at a minimum support threshold of 0.1% and at a minimumconfidence threshold of 85% (which is depicted in the figure). Analyzing theresults, one can make the advertising process more successful and the structure ofthe portal can be changed such that the pages contained by the rules are accessiblefrom each other.Another type of decision can be made based on the information gained from asequence mining algorithm. Figure 3 (b) shows a part of the discovered sequencesof the SM-Tree algorithm from the data. The percentage valuesdepicted in Figure 3 (b) are the support of the sequences.The frequent tree mining task was accomplished using the PD-Tree algorithm. Apart of the result of the tree mining algorithm is depicted in Figure 4 (a). Thepatterns contain beside the trees (represented in string format), also the supportvalues. The graphical representations of the patterns are depicted in Figure 4 (b)without the support values.(a) (b)Figure 3(a) Association rules and (b) sequential rules based on the msnbc.data(a) (b)Figure 4Frequent tree patterns based on msnbc.data in (a) string and (b) graphical representationConclusionsThis paper deals with the problem of discovering hidden information from largeamount of Web log data collected by web servers. The contribution of the paper isto introduce the process of web log mining, and to show how frequent patterndiscovery tasks can be applied on the web log data in order to obtain usefulinformation about the user’s navigation behavior.AcknowledgementThis work has been supported by the Mobile Innovation Center, Hungary, by thefund of the Hungarian Academy of Sciences for control research and theHungarian National Research Fund (grant number: T042741).References[1] R. Iváncsy and I. Vajk, “Time- and Memory-Efficient Frequent ItemsetDiscovering Algorithm for Association Rule Mining.” InternationalJournal of Computer Applications in Techology, Special Issue on DataMining Applications (in press)[2] R. Iváncsy and I. Vajk, “Efficient Sequential Pattern Mining Algorithms.”WSEAS Transactions on Computers, vol. 4, num. 2, 2005, pp. 96-101[3] R. Iváncsy and I. Vajk, “PD-Tree: A New Approach to SubtreeDiscovery.”, WSEAS TRansactions on Information Science andApplications, vol. 2, num. 11, 2005, pp. 1772-1779[4] Q. Yang and H. H. Zhang, “Web-log mining for predictive web caching.”IEEE Trans. Knowl. Data Eng., vol. 15, no. 4, pp. 1050-1053, 2003[5] Kosala and Blockeel, “Web mining research: A survey,” SIGKDD:SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) onKnowledge Discovery and Data Mining, ACM, vol. 2, 2000[6] S. K. Madria, S. S. Bhowmick, W. K. Ng, and E.-P. Lim, “Research issuesin web data mining,” in Data Warehousing and Knowledge Discovery,1999, pp. 303-312[7] J. Borges and M. Levene, “Data mining of user navigation patterns,” inWEBKDD, 1999, pp. 92-111[8] M. N. Garofalakis, R. Rastogi, S. Seshadri, and K. Shim, “Data mining andthe web: Past, present and future,” in ACM CIKM’99 2nd Workshop on WebInformation and Data Management (WIDM’99), Kansas City, Missouri, USA, November 5-6, 1999, C. Shahabi, Ed. ACM, 1999, pp. 43-47[9] S. Chakrabarti, “Data mining for hypertext: A tutorial survey.” SIGKDD:SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, vol. 1, no. 2, pp. 1-11, 2000 [10] M. Balabanovic and Y. Shoham, “Learning information retrieval agents:Experiments with automated web browsing,” in Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources, 1995, pp. 13-18[11] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorizationusing hyperlinks,” in SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM Press, 1998, pp. 307-318[12] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S.Tomkins, “The Web as a graph: Measurements, models and methods,”Lecture Notes in Computer Science, vol. 1627, pp. 1-18, 1999[13] J. Hou and Y. Zhang, “Effectively finding relevant web pages from linkageinformation.” IEEE Trans. Knowl. Data Eng., vol. 15, no. 4, pp. 940-951,2003[14] H. Han and R. Elmasri, “Learning rules for conceptual structure on theweb,” J. Intell. Inf. Syst., vol. 22, no. 3, pp. 237-256, 2004[15] M. Eirinaki and M. Vazirgiannis, “Web mining for web personalization,”ACM Trans. Inter. Tech., vol. 3, no. 1, pp. 1-27, 2003[16] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, “Mining access patternsefficiently from web logs,” in PADKK ’00: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications. London, UK: Springer-Verlag, 2000, pp.396-407[17] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for miningworld wide web browsing patterns,” Knowledge and Information Systems,vol. 1, no. 1, pp. 5-32, 1999[18] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, “Web usagemining: Discovery and applications of usage patterns from web data,”SIGKDD Explorations, vol. 1, no. 2, pp. 12-23, 2000[19] M. S. Chen, J. S. Park, and P. S. Yu, “Data mining for path traversalpatterns in a web environment,” in Sixteenth International Conference on Distributed Computing Systems, 1996, pp. 385-392[20] J. Punin, M. Krishnamoorthy, and M. Zaki, “Web usage mining:Languages and algorithms,” in Studies in Classification, Data Analysis, and Knowledge Organization. Springer-Verlag, 2001[21] P. Batista, M. ario, and J. Silva, “Mining web access logs of an on-linenewspaper,” 2002[22] O. R. Zaiane, M. Xin, and J. Han, “Discovering web access patterns andtrends by applying olap and data mining technology on web logs,” in ADL ’98: Proceedings of the Advances in Digital Libraries Conference.Washington, DC, USA: IEEE Computer Society, 1998, pp. 1-19[23] J. F. F. M. V. M. Li Shen, Ling Cheng and T. Steinberg, “Mining the mostinteresting web access associations,” in WebNet 2000-World Conference on the WWW and Internet, 2000, pp. 489-494[24] X. Lin, C. Liu, Y. Zhang, and X. Zhou, “Efficiently computing frequenttree-like topology patterns in a web environment,” in TOOLS ’99: Proceedings of the 31st International Conference on Technology of Object-Oriented Language and Systems. Washington, DC, USA: IEEE Computer Society, 1999, p. 440[25] A. Nanopoulos, Y. Manolopoulos, “Finding generalized path patterns forweb log data mining,” in ADBIS-DASFAA ’00: Proceedings of the East-European Conference on Advances in Databases and Information Systems Held Jointly with International Conference on Database Systems for Advanced Applications. London, UK: Springer-Verlag, 2000, pp. 215-228 [26] A. Nanopoulos and Y. Manolopoulos, “Mining patterns from graphtraversals,” Data and Knowledge Engineering, vol. 37, no. 3, pp. 243-266, 2001[27] X. Jin, Y. Zhou, and B. Mobasher, “Web usage mining based onprobabilistic latent semantic analysis,” in KDD ’04: Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM Press, 2004, pp. 197-205 [28] S. Jespersen, T. B. Pedersen, and J. Thorhauge, “Evaluating the markovassumption for web usage mining,” in WIDM ’03: Proceedings of the 5thACM international workshop on Web information and data management.New York, NY, USA: ACM Press, 2003, pp. 82-89[29] A. Nanopoulos, D. Katsaros, and Y. Manolopoulos, “Exploiting web logmining for web cache enhancement,” in WEBKDD ’01: Revised Papers from the Third International Workshop on Mining Web Log Data Across All Customers Touch Points. London, UK: Springer-Verlag, 2002, pp. 68-87[30] A. Nanopoulos, D. Katsaros and Y. Manolopoulos, “A data miningalgorithm for generalized web prefetching,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 1155-1169, 2003 [31] J. Zhang and A. A. Ghorbani, “The reconstruction of user sessions from aserver log using improved timeoriented heuristics.” in CNSR. IEEE Computer Society, 2004, pp. 315-322。

相关文档
最新文档