WEB RECOMMENDATION SYSTEM BASED ON A MARKOV-CHAIN MODEL
基于Hadoop的新闻推荐算法研究
基于Hadoop的新闻推荐算法研究发布时间:2023-02-01T05:32:28.525Z 来源:《科学与技术》2022年第16期8月作者:尹铁源张思淇[导读] 随着线上阅读新闻方式的兴起,传统的新闻推荐算法存在着特征稀疏、缺少多样性等问题。
为解决以上问题,本文提出一种基于Hadoop的融合兴趣模型推荐算法。
尹铁源张思淇沈阳工业大学信息科学与工程学院辽宁沈阳 110870摘要:随着线上阅读新闻方式的兴起,传统的新闻推荐算法存在着特征稀疏、缺少多样性等问题。
为解决以上问题,本文提出一种基于Hadoop的融合兴趣模型推荐算法。
首先,考虑特征稀疏问题,将特征词扩展得到兴趣扩展模型,其次,考虑新闻热度和阅读时长对相似度的影响,提出了改进的相似度计算方法,得到用户潜在兴趣扩展模型,最后,将两个模型进行混合得到融合兴趣模型,进行新闻推荐。
实验结果表明,在hadoop中运行改进后的算法,推荐效果有所提升。
关键词:新闻推荐;Hadoop;基于内容的推荐Research on Hadoop-based news recommendation algorithmYIN Tie-yuan, ZHANG Si-qi(School of information science and engineering, Shenyang University of technology, Shenyang, Liaoning 110870)Absrtact: With the rise of online news reading, traditional news recommendation algorithms have some problems, such as sparse features and lack of diversity. To solve the above problems, this paper proposes a Hadoop based fusion interest model recommendation algorithm. Firstly, considering the problem of feature sparsity, the feature words are extended to obtain the interest expansion model. Secondly, considering the impact of news popularity and reading time on the similarity, an improved similarity calculation method is proposed to obtain the user potential interest expansion model. Finally, the two models are mixed to obtain the fusion interest model for news recommendation. The experimental results show that the performance of the improved algorithm in Hadoop is improved.Key words: news recommendation; Hadoop; Content based recommendations1引言随着互联网的崛起式发展,更多的人偏爱于网上阅读新闻报道,但由于网络上新闻报道的数量成千上万,使得用户在海量新闻中陷入迷茫,这就产生了“信息过载”的问题[1]。
写一篇希望喜欢的网站改进建议的英语作文
写一篇希望喜欢的网站改进建议的英语作文全文共10篇示例,供读者参考篇1Dear website,I am a primary school student and I really like visiting your website. It's super cool! But I have some ideas on how you can make it even better. Here are my suggestions:Firstly, I think you could add more fun games and activities for kids to play. Maybe some puzzles or quizzes that are related to the topics on your website. That way, we can learn while having fun.Secondly, I noticed that some of the articles are a bit long and difficult to understand. It would be helpful if you could break them down into smaller sections and use simpler language. This way, it will be easier for us to follow along and learn new things.Thirdly, I think it would be great if you could have a section where we can share our own ideas and creations. It could be like a mini blog for kids to write about their interests and hobbies.This way, we can connect with other kids who have similar interests.Overall, I really enjoy visiting your website and I hope you will consider my suggestions to make it even better. Thank you for providing such a fun and educational platform for kids like me to explore!Yours sincerely,[Your name]篇2Hello everyone! Today I want to share with you some suggestions on how to improve your favorite website. I hope you enjoy it!First of all, I think the website could use some more interactive features. For example, it would be cool to have a chat room where users can talk to each other in real-time. It would also be fun to have some games or quizzes that users can play while they are on the site.Secondly, I believe that the website could benefit from a better search function. Sometimes it can be hard to find whatyou are looking for, so having a more advanced search tool would make it easier for users to navigate the site.Another suggestion is to add more personalized content. It would be nice if the website could recommend articles or videos based on the user's interests and browsing history. This way, users would feel more connected to the site and would be more likely to visit it regularly.Lastly, I think it would be great if the website could incorporate more multimedia elements. For example, adding videos or podcasts to complement the written content would make the site more engaging and appealing to users.Overall, I think these suggestions would make the website even more enjoyable for users. I hope the website developers will consider implementing some of these ideas in the future. Thank you for listening!篇3Hey everyone! Today I want to talk about my favorite website and some suggestions I have to make it even better. My favorite website is YouTube because I love watching videos of my favorite YouTubers, learning new things, and listening to music.First, I think YouTube should have a feature where you can create playlists with your favorite videos. This way, you can easily find and watch all your favorite videos in one place. It would be so cool to have different playlists for different moods or topics!Second, I think YouTube should have a "watch later" button so you can save videos to watch when you have more time. Sometimes I see a video I want to watch, but I don't have time right then. It would be so helpful to have a way to save it for later.Third, I think YouTube should improve its recommendation system. Sometimes, the videos it suggests for me aren't things I'm interested in. It would be great if YouTube could recommend more videos that I would actually enjoy based on what I watch.Overall, I love YouTube and I think these improvements would make it even better. What do you think? Do you have any suggestions for your favorite website? Let me know in the comments!篇4Dear website,I really like your website because it has so many cool games and videos to watch. But I think there are some things that you could do to make it even better!One thing you could do is add more games for different age groups. Some of the games on your website are a little too hard for me and my friends. It would be awesome if you could make some games that are easier for kids like us to play.Another thing you could do is add a chat feature so that we can talk to our friends while we are playing games. It would be so much fun to be able to chat and play at the same time!I also think it would be cool if you could have a section where we can write reviews of the games and videos on your website. That way, we can tell other kids what we like about them and help them decide what to play.Overall, I really like your website and I hope you will consider my suggestions to make it even better. Thank you for reading this letter!Sincerely,[Your name]篇5Hey guys, have you ever thought about how cool it would be if our favorite website could be even better? Well, I have some ideas for making our favorite website even more awesome!First of all, I think it would be super helpful if the website had a search bar at the top of the page. That way, if we want to find something specific, we can just type it in and it will show up right away. It would save us a lot of time scrolling through all the different pages.Another suggestion I have is to add a section where we can write reviews for the things we like on the website. For example, if we read a really interesting article or watch a funny video, we could leave a comment about how much we liked it. It would be fun to see what other people think too!Lastly, I think it would be really cool if the website had a chat feature so we could talk to our friends while we are on the site. We could share our thoughts about the latest news or gossip about our favorite celebrities. It would make the website feel more like a community.I hope the people who run the website will consider my suggestions because I know it would make me and my friends really happy. Thanks for listening to my ideas!篇6Hey guys, today I want to talk about my favorite website and some suggestions for improvement. The website I really like is YouTube. I love watching videos of cute animals, funny skits, and cool science experiments. But sometimes there are things that I wish YouTube could do better.First of all, I think YouTube should have a better system for recommending videos. Sometimes I watch one video about cats, and then my whole recommended list is just cat videos. It would be nice if the recommendations were more diverse and showed me videos on different topics that I might be interested in.Secondly, I think YouTube should have more ways for me to interact with my favorite creators. I would love to be able to send them messages or even ask them questions during live streams. It would make me feel more connected to the people that I watch and admire.Lastly, I think YouTube should do a better job of monitoring and removing inappropriate content. Sometimes I stumble upon videos that are not suitable for kids like me, and it can be really scary or upsetting. I think YouTube should make sure that all the videos on their site are safe and appropriate for everyone.Overall, I really love YouTube and spend a lot of time on it. I just think that with a few small changes, it could be even better. Thanks for listening to my suggestions!篇7Hello everyone! Today I want to talk about my favorite website and give some suggestions for improvements. The website I really like is YouTube because I can watch funny videos, learn new things, and listen to music.First of all, I think YouTube should have more videos for kids like me. Sometimes it's hard to find videos that are appropriate for my age. It would be great if they have a special section just for kids with fun and educational videos.Secondly, I think YouTube should improve the quality of their videos. Sometimes the videos are blurry or take a long time to load. It would be nice if they could make the videos clearer and faster to watch.Also, I wish YouTube could have more interactive features. It would be fun if I could play games or quizzes while watching videos. This way, I can learn and have fun at the same time.Lastly, I think YouTube should have better parental controls. Sometimes I come across videos that are not suitable for kids. It would be good if parents can have more options to filter out inappropriate content.In conclusion, I really love YouTube and I hope they can consider my suggestions to make the website even better for kids like me. Thank you for listening!篇8Hello everyone! Today I want to talk about a website that I really like and some suggestions for making it even better.The website I’m talking about is called . It’s a super fun website where you can play all kinds of awesome games for free. I love playing games on because there are so many different ones to choose from. Whether you like puzzle games, action games, or even dress up games, you can find something you enjoy on this website.One suggestion I have for is to add more new games more often. I love trying out new games and it would be really cool if there were new ones to play every week. It would keep things fresh and exciting for all the players.Another suggestion I have is to make the website easier to navigate. Sometimes it’s a little tricky to find the games I want to play because there are so many of them. It would be awesome if there were categories or filters to help me quickly find the games I’m interested in.Lastly, I think it would be great if had a chat feature so that players could talk to each other while they’re playing. It would be fun to make new friends and chat about the games we’re playing together.Overall, I really love CoolGa and I think it’s already a fantastic website. But I believe these suggestions could make it even better. I hope the people who run the website will consider these ideas.Thanks for reading my suggestions! Have a great day and keep on gaming!篇9Title: My Suggestions for Improving the Website I LoveHi everyone! I want to talk about my favorite website today. It's so awesome, but I think it can be even better with a few changes. Here are my suggestions for improving the website:First of all, I think the website should have more interactive features. Like maybe a chat room where users can talk to each other and make new friends. It would be super fun to chat with other people who love the website as much as I do.Secondly, I think the website should have a section where users can share their own stories and experiences. It would be so cool to read about how the website has helped or inspired other people. Plus, it would make the website feel more personal and welcoming.Another idea I have is to add more games and activities to the website. I love playing games online, and I think it would be really fun to have some games on the website that are related to the website's theme. It would make the website even more entertaining and engaging.Lastly, I think the website could use a makeover in terms of design. Maybe some new colors or graphics would make the website more visually appealing. A fresh new look could attract more users and keep current users coming back for more.Overall, I really love this website and I think these improvements could make it even better. I hope the website owners will consider my suggestions. Thank you for reading!篇10Title: My Suggestions for Improving My Favorite WebsiteHi everyone! Today I want to talk about my favorite website and some ideas to make it even better. My favorite website is ABC Kids, a website full of fun games, cartoons and educational activities. I love spending time on it, but I think there are some ways it could be improved.First of all, I think the website could add more new games and activities. While I love the games that are already on there, it would be great to have some new ones to keep things fresh and exciting. It would also be cool to be able to customize your own profile on the website with your favorite characters and colors.Another idea is to have more interactive features on the website. For example, it would be fun to be able to chat with other kids who are on the website at the same time. We could share tips and tricks for the games, or just chat about our favorite cartoons.I also think it would be nice to have a section on the website where kids can submit their own artwork, stories or videos. It would be so cool to see what other kids have created, and it would be a great way to showcase our talents.Overall, I think ABC Kids is already a great website, but with a few tweaks and improvements, it could be even better. I hope the website developers will take my suggestions into consideration. Thanks for listening!。
web安全试题及答案整理涵盖考点考题
Part 1.Explanation of Terms, 30 pointsNOTE: Give the definitions or explanations of the following terms, 5 points for each.(1)Data IntegrityAssures that information and programs are changed only in a specified and authorized manner.In information security, integrity means that data cannot be modified undetectably.Integrity is violated when a message is actively modified in transit. Information security systems typically provide messageintegrity in addition to data confidentiality(2)Information Security AuditAn information security audit is an audit on the level of information security in an organization(3)PKIPKI provides well-conceived infrastructures to deliver security services in an efficient and unified style. PKI is a long-term solutionthat can be used to provide a large spectrum of security protection.(4)X.509In cryptography, X.509 is an ITU-T standard for a public key infrastructure (PKI) for single sign-on (SSO,单点登录)and Privilege Management Infrastructure (PMI,特权管理基础架构).The ITU-T recommendation X.509 defines a directory service that maintains a database of information about users for theprovision of authentication services…(5)Denial-of-Service AttackDoS (Denial of Service) is an attempt by attackers to make a computer resource unavailable to its intended users.(6)SOA(Service-Oriented Architecture)SOA is a flexible set of design principles used during the phases of systems development and integration in computing. A systembased on a SOA will package functionality as a suite of interoperable services that can be used within multiple, separate systemsfrom several business domains.(7)Access ControlAccess control is a system that enables an authority to control access to areas and resources in a given physical facility orcomputer - based information system. An access control system, within the field of physical security, is generally seen as the second layer in the security of a physical structure.(Access control refers to exerting control over who can interact with a resource. Often but not always, this involves an authority,who does the controlling. The resource can be a given building, group of buildings, or computer-based information system. But itcan also refer to a restroom stall where access is controlled by using a coin to open the door)(8)Salted ValueIn cryptography, a salt consists of random bits, creating one of the inputs to a one-way function. The other input is usually apassword or passphrase. The output of the one-way function can be stored (alongside the salt) rather than the password, and stillbe used for authenticating users. The one-way function typically uses a cryptographic hash function. A salt can also be combinedwith a password by a key derivation function such as PBKDF2 to- generate a key for use with a cipher or other cryptographicalgorithm. The benefit provided by using a salted password is making a lookup table assisted dictionary attack against the storedvalues impractical, provided the salt is large enough. That is, an attacker would not be able to create a precomputed lookup table(i.e. a rainbow table) of hashed values (password i salt), because it would require a large computation for each salt.(9)SOAPSOAP is a protocol specification for exchanging structured information in the implementation of Web Services in computernetworks. It relies on Extensible Markup Language (XML) for its message format, and usually relies on other Application Layerprotocols, most notably Hypertext Transfer Protocol (HTTP) and Simple Mail Transfer Protocol (SMTP), for message negotiationand transmission(10)ConfidentialityConfidentiality is the term used to prevent the disclosure of information to unauthorized individuals or systems. Confidentiality isnecessary (but not sufficient) for maintaining the privacy of the people whose personal information a system holds.(ll)AuthenticationIn computing, e-Business and information security is necessary to ensure that the data, transactions, communications ordocuments (electronic or physical) are genuine. It is also important for authenticity to validate that both parties involved are whothey claim they are.(12)KerberosKerberos is an authentication service developed at MIT which allows a distributed system to be able to authenticate requests forservice generated from workstations.Kerberos (ITU-T) is a computer network authentication protocol which works on the basis of “tickets” to allow nodescommunicating over a non-secure network to prove their identity to one another in a secure manner.(13)SSL/TLSSSL are cryptographic protocols that provide communication security over the Internet. TLS and its predecessor, SSL encrypt thesegments of network connections above the Transport Layer, using asymmetric cryptography for key exchange, symmetricencryption for privacy, and message authentication codes for message integrity.(14)Man-in-the-Middle AttackMan-in-the-Middle Attack is a form of active eavesdropping in which the attacker makes independent connections with the victimsand relays messages between them, making them believe that they are talking directly to each other over a private connection,when in fact the entire conversation is controlled by the attacker.(15)System VulnerabilityA vulnerability is a flaw or weakness in a system,s design, implementation, or operation and management that could be exploitedto violate the system,s security policy (which allows an attacker to reduce a system's information assurance). Vulnerability is theintersection of three elements: a system susceptibility or flaw, attacker access to the flaw, and attacker capability to exploit theflaw.(16)Non-RepudiationNon-repudiation refers to a state of affairs where the author of a statement will not be able to successfully challenge theauthorship of the statement or validity of an associated contract.The term is often seen in a legal setting wherein the authenticity of a signature is being challenged. In such an instance, theauthenticity is being "repudiated".(17)Bastion HostA bastion host is a computers on a network, specifically designed and configured to withstand attacks. It,s identified by the firewalladmin as a critical strong point in the network,s security. The firewalls (application - level or circuit - level gateways) and routers can be considered bastion hosts. Other types of bastion hosts include web, mail, DNS, and FTP servers.(18)CSRFCross-Site Request Forgery (CSRF) is an attack that forces an end user to execute unwanted actions on a web application in whichthey're currently authenticated. CSRF attacks specifically target state-changing requests, not theft of data, since the attacker hasno way to see the response to the forged request. With a little help of social engineering (such as sending a link via email or chat),an attacker may trick the users of a web application into executing actions of the attacker's choosing. If the victim is a normaluser, a successful CSRF attack can force the user to perform state changing requests like transferring funds, changing their emailaddress, and so forth. If the victim is an administrative account, CSRF can compromise the entire web application.Part 2.Brief Questions, 40 pointsNOTE: Answer the following HOW TO questions in brief, 8 points for each.(1)Asymmetric Cryptographic Method.非对称加密算法需要两个密钥:公开密钥(public - key)和私有密钥(private - key)。
旅游管理系统参考文献英文
旅游管理系统参考文献英文本文为旅游管理系统参考文献英文,包括以下内容:1. Ali, A., & Al-Sabaan, A. (2018). An efficient recommendation system for tourism management using social network analysis. Computers & Industrial Engineering, 125, 89-99.2. Chang, H. H., & Chen, S. W. (2015). The effects of online travel reviews on consumer behavior: A perspective of social influence. Tourism Management, 47, 46-54.3. Chen, C. C., & Tsai, D. C. (2016). An intelligent recommendation system for tourism planning. Journal of Hospitality and Tourism Technology, 7(1), 2-14.4. Huang, Y., Li, X., & Li, J. (2017). A tourism recommendation system based on user preferences and behavior analysis. Journal of Tourism and Hospitality Management, 5(1), 26-37.5. Li, X., Li, J., & Huang, Y. (2018). A personalized recommendation system for tourism based on big data analytics. Journal of Travel Research, 57(8), 1091-1105.6. Lin, H. F., & Chen, Y. C. (2016). A decision support system for tourism management: A case study in Taiwan. Journal of Travel and Tourism Marketing, 33(1), 43-58.7. Yang, Y., Lee, S. H., & Lee, S. H. (2015). An intelligent tourism recommendation system using sentiment analysis. Information Sciences, 325, 310-323.8. Yu, B., Zhang, J., & Guo, X. (2018). A personalized tourism recommendation system based on hybrid collaborative filtering algorithm. Journal of Hospitality and Tourism Management, 36, 1-12.9. Zhang, Y., Huang, W., & Zhang, X. (2017). A tourist behavior prediction and recommendation system based on big data. Journal of Hospitality and Tourism Technology, 8(4), 458-473.10. Zhou, L., Zhang, Y., & Wang, Y. (2015). An intelligent tourism recommendation system based on cloud computing. Journal of Hospitality and Tourism Technology, 6(1), 2-12.。
英语六级阅读 亚马逊商业推广经验
When Amazon recommends a product on its site,it is clearly not a coincidence.At root,the retail giant's recommendation system is based on a number of simple elements:what a user has bought in the past,which items they have in their virtual shopping cart,items they've rated and liked,and what other customers have viewed and purchased.Amazon(AMZN)calls this homegrown math"item-to-item collaborative filtering,"and it's used this algorithm to heavily customize the browsing experience for returning customers.A gadget enthusiast may find Amazon web pages heavy on device suggestions,while a new mother could see those same pages offering up baby products.Judging by Amazon's success,the recommendation system works.The company reported a29%sales increase to$12.83billion during its second fiscal quarter,up from $9.9billion during the same time last year.A lot of that growth arguably has to do with the way Amazon has integrated recommendations into nearly every part of the purchasing process from product discovery to checkout.Go to and you'll find multiple panes of product suggestions;navigate to a particular product page and you'll see areas plugging items"Frequently Bought Together"or other items customers also bought.The company remains tight-lipped about how effective recommendations are.("Our mission is to delight our customers by allowing them to serendipitously discover great products,"an Amazon spokesperson told Fortune."We believe this happens every single day and that's our biggest metric of success.")Amazon also doles out recommendations to users via email.Whereas the web site recommendation process is more automated,there remains to this day a large manual component.According to one employee,the company provides some staffers with numerous software tools to target customers based on purchasing and browsing behavior.But the actual targeting is done by the employees and not by machine.If an employee is tasked with promoting a movie to purchase like say,Captain America, they may think up similar film titles and make sure customers who have viewed other comic book action films receive an email encouraging them to check out Captain America in the future.Amazon employees study key engagement metrics like open rate,click rate,opt-out --all pretty standard for email marketing channels at any company--but lesser known is the fact that the company employs a survival-of-the-fittest-type revenue and mail metric to prioritize the Amazon email ecosystem."It's pretty cool. Basically,if a customer qualifies for both a Books mail and a Video Games mail, the email with a higher average revenue-per-mail-sent will win out,"this employee told Fortune."Now imagine that on a scale across every single product line--customers qualifying for dozens of emails,but only the most effective one reaches their inbox."The tactic prevents email inboxes from being flooded,at least by Amazon.At the same time it maximizes the purchase opportunity.In fact,the conversion rate and efficiency of such emails are"very high,"significantly more effective than on-site recommendations.According to Sucharita Mulpuru,a Forrester analyst,Amazon's conversion to sales of on-site recommendations could be as high as60%in some cases based off the performance of other e-commerce sites.Still,although Amazon recommendations are cited by many company observers as a killer feature,analysts believe there's a lot of room for growth."There's a collective belief within the e-commerce industry that Amazon's recommendation engine is a suboptimal solution,"says Mulpuru.Trisha Dill,a Well's Fargo analyst, says it's hard to fault Amazon for their recommendations,but she also says the company has a lot of work to do in offering users items more relevant to them.As an example,she points to a targeted email she received pushing a chainsaw carrying case.(She doesn't own a chainsaw.)Besides refining the accuracy of recommendations themselves,Amazon could explore more ways to reach customers.Already,the company has begun selling items previously sold in bulk that were too cost-prohibitive to ship individually like say,a deck of cards or a jar of cinnamon.Customers may buy them,but only if they have an order totaling$25or over.But the company could actively recommend these add-on products during check-out when an order crosses that pricing threshold,much like traditional supermarkets have impulse-purchase items like gum and candy bars at the register.At that point,the Amazon customer,just as they would in the supermarket,might think,"It's just a few more bucks.Why not?。
基于划分的Apriori 改进算法的网上商城推荐系统
作者姓名 工程领域
张志杰
学 校 指 导教 师 姓名 职 称
李青山 教授 陈玉彬 高工
软件工程 企 业 指 导教 师 姓名 职 称
提交论文日期
二〇一三年六月
西安电子科技大学 学位论文独创性(或创新性)声明
秉承学校严谨的学风和优良的科学道德,本人声明所呈交的论文是我个人在 导师指导下进行的研究工作及取得的研究成果。尽我所知,除了文中特别加以标 注和致谢中所罗列的内容以外,论文中不包含其他人已经发表或撰写过的研究成 果;也不包含为获得西安电子科技大学或其它教育机构的学位或证书而使用过的 材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中做了明确的说 明并表示了谢意。 申请学位论文与资料若有不实之处,本人承担一切的法律责任。 本人签名: 日期:
代
Байду номын сангаас
号
10701 TP391
学 密 编
号 级 号
10102047 公开
分类号 U D C
题 ( 中 、 英 文 ) 目基于划分的 Apriori 改进算法的网上商城推荐系统
The online recommended shopping system based on a divided-and-improved Apriori algorithm
摘要
本研究主要针对应用在 B2C 网上商城上的个性化推荐系统, 探讨如何有效地 运用数据挖掘技术从大量的购物交易记录中挖掘出所有商品的关联规则集,以推 荐适当的商品信息给购物者,帮助他们在众多商品中找到他们需要、有用的商品 信息。 在分析 B2C 电子商务领域具有代表性的网上商城网站特点的基础上, 对不同 的推荐技术进行了详细的比较,着重分析了基于内容、基于用户信息、关联规则、 协同过滤这几种推荐技术对于 B2C 网上商城网站的适用性。 在此基础上提出了一 种新的自动化推荐系统的推荐机制,并详细介绍了该机制如何与数据挖掘技术相 结合应用到推荐程序中。为了使系统返回更好的推荐结果,本研究针对传统关联 规则挖掘算法的不足提出了一种基于分层的 Apriori 改进算法, 并用网上商城网站 实例验证了算法的有效性。最后本文分析了基于关联规则的个性化推荐系统在实 际使用过程中的不足,并提出了一些针对 B2C 网上商城网站的解决方案。
基于用户的个性化影视推荐系统的研究与实现
摘要摘要随着互联网的飞速发展,每天都有浩如烟海的信息产生,面对数据量庞大的信息海洋,人们往往会感到无所适从,因此,推荐系统应运而生。
推荐系统的目的是主动向用户提供其感兴趣的物品或资源而无需用户主动搜寻。
经过20多年的发展,推荐系统已经深入到了人们生活的方方面面,如电子商务,新闻推荐,影视推荐等。
其中影视推荐是推荐系统技术研究的重要领域。
现有的影视推荐主要是热门推荐和相关推荐,热门推荐容易导致马太效应,而相关推荐在一定程度上符合用户喜好,但是个性化程度较低,不同用户在同一个播放页上看到的推荐列表往往是相同的。
协作过滤算法是推荐领域中最成功也是应用最广泛的推荐策略,常用于个性化推荐。
本文在基于用户的协作过滤策略的基础上进行改进。
用户评分的高低表达了对电影的喜好程度,而用户的标注行为表达了用户的喜好倾向,两者结合可以有效提升推荐结果的个性化程度。
本文首先在用户行为数据建模阶段对用户的行为数据进行分析,将用户的评分行为和标注行为结合起来建立了初始的用户行为数据模型。
同时,考虑到用户喜好并不是一成不变,参考“牛顿冷却定律”引入了时间衰减因子模拟整个时间轴上的用户喜好变化,对用户行为数据模型进行偏移处理。
之后使用该模型进行用户之间的类似程度计算,获得推荐的电影资源候选池。
在电影资源的评分预测阶段,考虑到标签在一定程度上也反映了电影资源的内容特征信息,参考信息挖掘领域“词频-逆文档频率”的思想建立电影资源和标签之间的联系并对侯选池中的电影资源进行评分预测的改进。
然后对本文做出的改进设计了对比实验验证其有效性,选取了Top-N推荐中常用的评价标准命中率(Hit-rate)和命中排序(Hit-rank)作为衡量指标进行相关实验,验证了在推荐同等数量电影资源的情况下,改进后的算法Hit-rate和Hit-rank 都要高于现有的协作过滤算法。
本文在最后以前文提出的改进的推荐算法为基础设计并实现了一个影视推荐系统,首先分析了系统的需求,然后根据需求进行相关设计,并用SS2H框架实现了该系统,并给出了系统主要的数据表展示与功能界面展示。
大众点评POI与评论推荐-毕业论文
---文档均为word文档,下载后可直接编辑使用亦可打印---摘要随着互联网和移动通信迅猛发展,电子商务强势崛起,越来越多的人倾向于网上消费。
如何从海量的互联网数据中筛选出用户感兴趣的信息成为了全球互联网用户潜在的问题,推荐系统(Recommendation System)技术通过搜索大量动态生成的信息来为用户提供个性化的内容和服务来解决这个问题。
推荐系统作为一种信息过滤方式,试图预测用户的偏好兴趣和对物品的评价。
近年来,频繁活跃的互联网用户在消费信息的同时也产出了海量的原创内容。
本文的主要研究工作是深度挖掘用户原创的评论内容,分析出用户和物品的特征,进而进行评分预测。
评论(Comment)指人对于事物做出的客观叙述,反映了人的主观感受。
基于用户的文本评论数据,本文的主要研究工作如下:首先,从互联网上采集包含有用户、物品和用户文本评论的数据。
该数据集来源于大众点评网。
然后对评论文本进行分词,用词向量对其进行数学表达,形成主题词的分布表。
最后,基于用户文本用评论主题词进行评分预测,通过线性回归模型和改进的协同过滤算法预测评分,最终的实验结果表明,预测的评分客观准确,同时组合的预测算法效果更优。
关键词:推荐系统;用户评论;线性回归;评分预测AbstractWith the rapid development of the Internet and mobile communications, and the strong rise of e-commerce, more and more people tend to spend online.How to filter the information that users are interested in from the massive Internet data has become a potential problem for global Internet users. Recommendation systems solve this problem by searching through large volume of dynamically generated information to provide users with personalized content and services.The recommendation system serves as an information filtering method that attempts to predict the user's preference for interest and the evaluation of the item.In recent years, frequent and active Internet users have also produced massive amounts of original content while consuming information.The main research work of this paper is to deeply mine user-originated commentary content, analyze the characteristics of users and items, and then make score predictions.Comment reflects people’s subjective feelings. Based on the user's text review data, the main research work of this paper is as follows:First, data containing user, item, and user text reviews is collected from the Internet. This dataset comes from the Dianping’s website. Then, the comment text is segmented and mathematically expressed by the word vector.Then the text of the comment is segmented and expressed mathematically by the word vector to form the distribution table of the topic word.Finally, based on the user's comment, the scores are predicted by the subject headings, and the linear regression model and the improved collaborative filtering algorithm are used to predict the scores. The final experimental results show that the predicted scores are objective and accurate, and the combined rating prediction algorithm is more effective.Keywords: Recommendation System; Users’ Comment; Linear Regression; Rating Forecast前言进入互联网时代后,技术发展日新月异,人类获取信息的数量也急剧增长,从匮乏到当前的过载,信息的获取信息的方式也逐渐多样化。
resume英语作文
resume英语作文Resume。
Name: Li Ming。
Gender: Male。
Date of Birth: September 1st, 1995。
Nationality: Chinese。
Education:Bachelor’s degree in Computer Science, Shanghai Jiao Tong University, 2014-2018。
Master’s degree in Computer Science, Massachusetts Institute of Technology, 2018-2020。
Skills:Proficient in programming languages such as Java, Python, and C++。
Experienced in software development, including web development and mobile app development。
Familiar with machine learning and data analysis。
Strong problem-solving and analytical skills。
Experience:Software Engineer, Google, 2020-present。
Developed and maintained software systems for Google’s search engine. Collaborated with cross-functional teams to improve the user experience and optimize search algorithms.Software Development Intern, Microsoft, 2019。
Worked on a team developing a mobile app forMicrosoft’s cloud platform. Contributed to the design and implementation of the app’s user interface and backend.Research Assistant, MIT Computer Science and Artificial Intelligence Laboratory, 2018-2020。
个性化推荐系统的文献综述
个性化推荐系统在电子商务网站中的应用研究一、引言随着Internet的普及,信息爆炸时代接踵而至,海量的信息同时呈现,使用户难以从中发现自己感兴趣的部分,甚至也使得大量几乎无人问津的信息称为网络总的“暗信息”无法被一般用户获取。
同样,随着电子商务迅猛发展,网站在为用户提供越来越多选择的同时,其结构也变得更加复杂,用户经常会迷失在大量的商品信息空间中,无法顺利找到自己需要的商品。
个性化推荐,被认为是当前解决信息超载问题最有效的工具之一.推荐问题从根本上说就是从用户的角度出发,代替用户去评估其从未看过的产品,使用户不只是被动的网页浏览者,而成为主动参与者。
准确、高效的推荐系统可以挖掘用户的偏好和需求,从而成为发现用户潜在的消费倾向,为其提供个性化服务。
在日趋激烈的竞争环境下,个性化推荐系统已经不仅仅是一种商业营销手断,更重要的是可以增进用户的黏着性。
本文对文献的综述包括个性化推荐系统的概述、常用的个性化推荐系统算法分析以及个性化推荐系统能够为电子商务网站带来的价值。
二、个性化推荐系统概述个性化推荐系统是指根据用户的兴趣特点和购买行为,向用户推荐用户感兴趣的信息和商品。
它是建立在海量数据挖掘基础上的一种高级商务智能平台,以帮助电子商务网站为其顾客购物提供完全个性化的决策支持和信息服务。
购物网站的推荐系统为客户推荐商品,自动完成个性化选择商品的过程,满足客户的个性化需求,推荐基于:网站最热卖商品、客户所处城市、客户过去的购买行为和购买记录,推测客户将来可能的购买行为。
1995年3月,卡内基 梅隆大学的Robert Armstrong等人在美国人工智能协会首次提出了个性化导航系统Web-Watcher,斯坦福大学的Marko Balabanovic 等人在同一次会议上推出了个性化推荐系统LIRA。
同年8月,麻省理工学院的Henry Liberman在国际人工智能联合大会(IJCAI)上提出了个性化导航智能体Letizia。
The Framework of Network Public Opinion Monitoring and Analyzing System
The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content IdentificationCheng Xian-Yi11,2, Zhu Ling-ling1,Zhu Qian2,Wang Jin11.School of Computer Science and Technology Nantong University, Nantong 226019, China2.School of Computer Science and Communications Engineering Jiangsu University,Zhenjiang 212013, ChinaE-MIAL: xycheng@doi:10.4156/jcit.vol5. issue10.7AbstractNetwork has become important public platform for the public to express the opinion, to discuss public affairs, to participate in economic social and political life. The spread of information network public opinion geometric progression growth, it is necessary to monitor and analyze network Public Opinion for the government to manage the public opinion information and to timely discover hot spots and to correctly guide public opinion trends. Therefore, Network Public Opinion monitoring and analyzing have become a hot issue in recent years. Now the main mature technology is the statistical analysis based on key word. However, there is still much room for improving its effectiveness. This paper describes a framework of network Public Opinion monitoring and analyzing system based on semantic content identification to solve some key problems of the public opinion.Keywords: Network Public Opinion, Natural Language Process, Semantic1. IntroductionIn recent years, with the development of Internet and the increasing number of netizens, some people disclosure and spread the sensitive and bad information through forums, IM, e-mail and so on, which threat the social stability and people's life and property. On the one hand, the national legislation and regulations should put forward higher attentions in the focuses of public opinion to server the public better; on the other hand, Government should takes on vital responsibilities in takes on vital responsibilities in correctly monitoring the sensitive public opinion and guiding them which protect network users from bad information and build a harmonious socialist country. According to the preliminary statistics data from Internet center, there have been large directly and indirectly losses since 1996, which we can be seen from Figure 1. Therefore, Network Public Opinion monitoring and analyzing have become an urgent and important issue [1].Where the Y Axis express State loss (in million dollars on units)Figure 1. The statistics of the harm to society by spreading the Network Public OpinionThe most important technologies about network Public Opinion analysis include text filtering, text classifying, clustering, viewpoint tendentiousness recognizing, tracking topics, automaticsummarizing and so on, which have been concerned about for a long time by domestic and foreign workers . In order to control information more effectively, This paper describes a framework of network Public Opinion monitoring and analyzing system based on semantic content identification.2. Research SituationResearchers from DARPA 、CMU、University of Massachusetts and Dragon Systems, Inc have began to define topic detection and tracking study and developed TDT. The important technology of this project is content classification of information, which resolves a contradiction between the processing speed and safety monitoring of the real-time monitoring and make it feasible. There are some studies about it abroad such as the PICS of the W3C which have become classification standard on WWW. There are two International general classification standards: SACi and Safesurf, which are both accord with the PICS. On the one hand, the classification technique is used for web page classification and filtering; on the other hand, the foreign policy and standards are not fully suitable for China's national conditions for various reasons.In China, Founder ZhiSi public opinion warning DSS [2]designed by Institute Founder is successful. The system has successfully achieved automatic real-time monitoring and analysis of the massive public opinion. It is more effective for government to monitor the public option than traditional manual mode .It also do some to strengthen the supervision of internet information and play a certain role for the network sudden public events .This DSS provide the function including such as full text retrieval, automatic sorting, automatic cluster, subject examination/tracing, related recommendation and disappear heavy, connection and tendency analysis, automatic abstract and key word extraction, thunderbolt analyzes, generate statistics and so on.Goonie network Public Opinion and information monitor system combines internet search technology; information intellectualized process technology and knowledge manage method. It realize network public opinion monitor and special news trace to briefing, report and so on through auto collect, auto classify combine, subject collection, focus special topic. Therefore Goonie can master public opinion, make proper consensus and provide report analyze [3].A framework of content security monitoring system was designed based on human-computer combination in literature [4]. The framework is a hierarchy, there are three levels: the data acquisition layer, content analysis, output layer. It’s function mainly examine the information based on content by the content analysis and identify the bad information; on the same time, it can provide electronic evidence for the bad use of information by recording the source and content of information and tracking them by effective audit analysis.Although there are many units engaged in domestic internet content filtering direction of the research, and try to achieve the purpose of purifying network environment. But these techniques are still in the bud, there are still some deficiencies in "the semantic information filtering"3. System FrameworkThe purpose of the system is to achieve a large-scale network environment monitoring report of Network Public Opinion through testing, acquirement, theme, hot topics and events tracking, experiments monitoring and so on, which can form many representation modes of analysis results, such as brief, reports, charts etc. Therefore, the system can master public opinion, make proper consensus and provide report analyze. Monitoring system for network Public Opinion module function block diagram in Figure 2.There are five stages including resource discovery, information selection, pattern discovery, information extraction, public opinion handling.Structure information knowledgeText analysis 、auditText purificationunstructure text structure textThememanagment warning 、filter 、counter 、monito r 、decision eventsearch report abstract Topicpattern Semantic index Event pattern Tendentious analysis filtere noise extracte thesubject expression luster filter (Digital Collection 、format shifting 、Data Import / Export ) textBBS blog E-mail Chattingroom WEB public opinionhandling informationextraction PatternDiscoveryselectinformationResourcesdiscoveryclientserverTopicsearch trendanalysis Figure 2. logic structure of the network public opinion monitoring system function moduleFigure 3 is the system workflow. The systems include the following five databases:1) Public opinion planning information database: To collect demand information of the Public opinion including the online news, BBS, RSS, chat room, blog, polymerization news (RSS), etc.2) Public opinion analysis information database: To collect the storage data through classification and clustering, keyword extraction, removal the duplication and filter, named entity recognition, semantic computing etc. which information database is structured.3) Public opinion database: To storage products related to public opinion analysis report, survey report, experience summary and related information.4) Semantic dictionary: Ontology knowledge, etc5) HNC knowledge: 466 sentence knowledge, etc [6].Public Sentiment Resources discovery select informationinformationextraction Pattern DiscoveryPublic opiniondatabaseprogramming Public opinion informationanalysis HNC knowledge Public opinionProducts semantic dictionaryPopular feelingshandling Figure 3. System workflow of the network public opinion monitoringFigure 4 is the client workflowShow topic listmodify thetheme ?Theme managementyes ThemeselectionnoServer-sideprocessing informationextraction popular feelings information handling user continue exitwarn decisionminor counter filter Topic retrieval eventretrieval Topic retrieval Summary of public opinion public reportFigure 4. Client workflowFigure 5 is data flow charts of the system. The interaction between the various modules are different: Data interaction is based on file between the resources discovery module and select information module; Select information module deal with information from the text to vector or ontology; Use GATE tagging to name entity in pattern discovery module and determine the relationship between entities and then discover the event pattern or the topic pattern; Information extraction module mainly do the semantic computing and transform the patterns into templates, which will make the unstructured information into structure information; Public opinion handling module need to carry on the inquiry according to the user and give these results to user with the suitable manifestation. simultaneously, the module receive the user’s establishment and inquiry request.Resources discovery Pattern Discovery select informationPopular feelingshandlinginformation extraction Network informationServer clientAnalysis of unstructured text Vector, ontology StructuredResults Databaseuser requests Results showFigure 5.data flow charts of the systemFigure 6 is an entire system network topology. The system can be a lot of users; each user can connect to a server. Servers can share data and exchange information each other by network, network connection scenarios can be P2P or client-server. Future which will be constantly modified and optimizedNetwork medialaw enforcementSpecial contentmanagementNetwork Communication Office AdministrativeExtranetFigure 6. network topology4. Working Process4.1. Resources discovery based on latent semantic analysisResources discovery, which retrieve the necessary network resources, is a process to integrate 、consolidate、mapping data by the different network information pattern. There are different retrieval tools and the strategies between the recourses.The BBS, chat room, the e-mail are short and random. First, use the DTS to Import / Export the document, and then eliminate the problem of the algorithm which ignores the environment and synonyms misjudgment based on the theme of latent semantic analysis, while using SVD to achieve the information filtering and noise removal purposes. We can find topics drift effectively and timely and meet the requirement of the public surveillance better according to the content of the document similarity calculation and clustering analysis.4.2. select informationSelect information is to achieve specialized information from the network by automatically selecting and pre-processing .First, filtered noise, recognize the named entity, extract the subject and the event ;Secondly, classification、luster、filter the text according to topics or events; Finally, discriminate the text.1) Text classification based on semi-supervised learningDistinctive feature of public opinion information is a short text, which should deal with massive data. The traditional text classification algorithm is a supervised learning, which to learn the calibration samples by the category tag settled and to determinate its category according to the text semantic content. It needs a large label samples trains to a good classifier. It’s easy to access the large number of unmarked data but to be high costs and impractical for marked data, which will create a bottleneck When dealing with huge amounts of data by the traditional text classification. We use text classification based on semi-supervised learning to overcome the sparsely of the short text and to improve the accuracy of short text classification algorithm. And in order to increase the robustness of the algorithm, better to avoid falling into local optimal solution; will integrate the Bagging algorithm integrated into semi-supervised learning.2) Bad information detectionBad information detection is one of the key factor of monitoring system about website content. It is only based on keywords on network information for recognizing and filtration for traditional network detection system. If you want to mask a number of cult sites, those who criticize the cult will aloes be filtered out. Therefore, we put forward a method to test poor information content based on HNC(figure 7),which are not by way of matching keywords and to judge what text information filtering needs according to the meaning of sentences.4.3. Pattern DiscoveryPattern discovery will achieve hot topic detection and concern about the incident tracking and orientation analysis by data mining and semantic computing based on the data from selecting information module. The module is core of the system.Pattern Discovery is presented as follows:1) First, we obtain four tables by using the ICTCLAS researched by the computer software of Chinese Academy of Sciences to achieve word segmentation and POS tagging:Theme Table (ID, title, text, author, time, vector)Comment table (ID, theme, title ID, text, author, time, tendentious value)Topics table (ID, Keywords group, the number of participants, time, polarity, viewpoint opposition, Notes)Topic - theme map table (topic ID, theme ID)A theme ID will be progressive distribution when inserting a database, we will keep on corresponding relations between comments and topic by the theme ID When saving comments information. Moreover, the third table holds basic clustering information; the fourth table holds the theme of each cluster contained which is the subject of the topics.Article PretreatSentenceAnalysisContext Generationshort-termmemory Position to judge thesemantic Red and blackcheck Network map object positionHNC 概念知识库、词语知识库HNC Judgments Semantic Library Red and black objects LibraryElements of theframework textSentencesemanticstructureText nature: 1 absolutely black, 2 absolutely red, 3black, 4 suspicious III,5 suspicious II,6 suspicious 1,7neutralitiesFigure 7. bad information detection algorithm diagrams2) Tendency AnalysisFirst we get ready for tendency dictionary to achieve first dictionary based on marked polarity and strengthen by artificial labeled method in How-Net, and then manually add some common words. We should establish a good tendency dictionary using hash table provided by Java language because that there need to quickly check the inclination.Next, read the text, process it sentence by sentence., remove the stop-words for each sentence, query tendency dictionary word by word, calculate its context polarity and strength for the polarity of the words. Then, add up all the polar components, receive sentences density situation divided by the square root of the number of commentsFinally, represent with tendency value according to distributed situation division commentary tendency and rank.3) Popular feelings key point analysis: Query the comments from the database according to the topic - theme map table and rank the basis for hot spot.Calculate incident concern by the opposition of the topic view combined with its comments. select an initial point based on the basic accumulation unit on a unit of time (for example: days), And then calculate the time point of the topic view by the opposition of counting only the comments before the point in time, latter, opinion in opposition to the added value of this time are received by subtracting the value of a point in time to previous time value .that the trend of events can be obtained4.4. information extractionThis module mainly gets structure data and obtains several databases for analysis and confirm or esplanade the mode mined out. we can use GATE [6]: entity recognition, entity-relationship recognition, events recognition, summary generation, etc.4.5. popular feelings information handling1) The warningWarning module of public opinion collects network information; discover the problem (things) and feedback.Warning is active at a given time period show with the theme related events, the topic of the trend.2) FilteringFiltering is just too bad information. The network management gets rid of negative news by monitoring at all times. Collect sensitive phrase from different fields and set a weight value for each phrase and use intelligent software to find sensitive phrase matching according to weights. The information will be shielded beyond a certain threshold established.3) CounterFirst, gain its IP, and then lock it. We can use each effective attack method to carry on the fixed-point attack disseminate for unsafe information of Hub the website (for example information seepage technology, viral technology, advanced hacker attack technology and so on).It can prevent the unsafe information from spreading and countering.4) MonitoringThe system lists all the events or topics about the subject after entering the start time of monitoring, the users select the suspected event or topic, monitoring module will continuously monitor. Monitoring and early warning is different that the former is passive surveillance, early warning is active.5) DecisionA complete decision-making is often not possible, but an iterative process. In this process, human-computer interaction can be used by policy makers in the parameters of different options and alternatives.5. ConclusionsThere are heavy workloads for the traditional machine learning methods which need to be manually tagging train classifiers netizens. This paper application content identification technology based on semantics to design a framework of analysis and monitoring network Public Opinion system for the comment being relatively short and broad emotional vocabulary. The next step we will pass the experiments to show that the system can achieve a more satisfactory result.6. References[1] Li Yonghao. Simulation and analysis of Rapid screening algorithms about network hot topics.Computer Communication Laboratory, Beijing Jiaotong University, internal communication documents. 2006.14-16[2] Founder Technology Research Institute. Public opinion on science and technology means tosupport network monitoring and analysis of unexpected events - Founder ZhiSi public opinion warning DSS. Informatization. 2005:50-52[3] /news/industrynews/2008/05/2008-05-03122.html[4]Li Yanling. Security monitoring system framework and its key technology of BBS content.Research Institute of China Electronics. 2007,2(4):144-149[5] Jin Yaohong. HNC language understanding technology and its applications [M]. Beijing: SciencePress. 2006.[6]/。
基于认证的移动网络中的信任模型——英文翻译
Certification-based trust models in mobile ad hoc networks:A survey and taxonomyMawloud Omar,nUniversite A/Mira,ReSyD,Bejaia,AlgeriaYachne Challal,Abdelmadjid BouabdallahUniversite de Technologie de Compiegne,Heudiasyc-UMR CNRS 6599,Compiegne,France AbstractA mobile ad hoc network is a wireless communication network which does not rely on a pre-existing infrastructure or any centralized management. Securing the exchanges in such network is compulsory to guarantee a widespread development of services for this kind of networks. The deployment of any security policy requires the definition of a trust model that defines who trusts who and how. There is a host of research efforts in trust models framework to securing mobile ad hoc networks. The majority of well-known approaches is based on public-key certificates,and gave birth to miscellaneous trust models ranging from centralized models to web-of-trust and distributed certificate authorities. In this paper,we survey and classify the existing trust models that are based on public-key certificates proposed for mobile ad hoc networks,and then we discuss and compare them with respect to some relevant criteria. Also,we have developed analysis and comparison among trust models using stochastic Petri nets in order to measure the performance of each one with what relates to the certification service availability.Keywords: mobile ad hoc network,trust models,certificates1. IntroductionMobile ad hoc networking is emerging as an important area for new developments in the field of wireless communication. The premise of forming a mobile ad hoc network is to provide wireless communication between heterogeneous devices,anytime and anywhere,with no infrastructure. These devices,such as cell phones,laptops,palmtops,etc. carry out communication with other nodes that come in their radio range of connectivity. Each participating node provides services such as message forwarding,providing routing information,authentication,etc. to form a network with other nodes spread over an area. With the proliferation of mobile computing,mobile ad hoc networking is predicted to be a key technology for the next generation of wireless communications. They are mostly desired in military applications where their mobility is attractive,but have also a high potential for use in civilian applications such as coordinating rescue operations in infrastructure-less areas ,sharing content and network gaming in intelligent transportation systems,surveillance and control using wireless sensor networks,etc.Inherent vulnerability of mobile ad hoc networks introduces new security problems,which are generally more prone to physical security threats. The possibility of eavesdropping,spoofing,denial-of-service,and impersonation attacks increases. Similar to fixed networks,security of mobile ad hoc networks is considered from different points such as availability,confidentiality,integrity,authentication,non repudiation,access control and usage control. However,security approaches used to protect the fixed networks are not feasible due to the salient characteristics of mobile ad hoc networks. New threats,such as attacks raised from internal malicious nodes,are hard to defend. The deployment of any security service requires the definition of a trust model that defines who trusts who and how. There are recent research efforts in trust models framework to securing mobile ad hoc networks. There exist two main approaches: (1) cooperation enforcement trust models,and,(2) certification- based trust models. In Table 1,we present the major differences between cooperation enforcement trust models and certification-based trust models.Table 1Cooperation enforcement vs. certification-based trust modelsThe first trust models category is based basically on reputation among nodes. The reputation of a node increases when it carries out correctly the tasks of route construction and data forwarding. The models of this category support effective mechanisms to measure the reputation of other nodes of the network. They also incorporate techniques that isolate the misbehaving nodes that are those that show a low reputation value. Trust models based on cooperation enforcement are well surveyed in the literature. Marias et al. provided such a thorough survey of cooperation enforcement trust models. In this paper,we are interested in the category of certification-based trust models. Indeed,in this category,the trust relationship among users is performed in a transitive manner,such that if A trusts B,and B trusts C,then A can trust C. In this relationship,the principal B is called Trusted Third Party (TTP). The latter could be a central authority (like CA –Certification Authority) or a simple intermediate user. Both points of view gave birth to two categories of models: (a) Authoritarian models,and (b) Anarchic models. In this paper,we review and classify the existing certification-based trust models belonging to each category. Moreover,to determine the efficiency of a given trust model,it is very important to estimate the certification service availability with respect to mobile ad hoc networks configuration. Therefore,we have modeled the certification process of each surveyed trust model using stochastic Petri nets (SPN). As you will see in the following sections,this allows a better understanding of the performances of the different models and how to leverage some parameters forhigher certification service availability.While a number of surveys covering the issues of key management in mobile ad hoc networks,have provided some insightful overviews of the different schemes proposed in the literature,none of them focuses on issues related to certificates management thoroughly (the scheme architecture,how the certificates are stored and managed,the complexity evaluation of the certification protocol,etc.). To complement those efforts,this work provides detailed taxonomy of certification-based trust models,and illustrates in depth the different schemes by providing the advantages and drawbacks of each one with respect to relevant criteria. The careful examination and analysis has allowed us to carry out a comparative study of the proposed schemes based on an analytic evaluation. The ultimate goal of this paper is to identify the strengths and weaknesses of each scheme in order to devise a more effective and practical certificate-based trust models which can achieve a better trade-off between security and performance.The remaining of this paper is structured as follows. In Section 2,we recall background material relating to basic concepts on cryptography and threshold cryptography. Then,in Section 3,we identify requirements relating to certificates management with respect to mobile ad hoc networks environment and constraints,and in Section 4 we propose a tax on o my of the existing certification-based trust models. Respectively,in Sections 5 and 6,we review the authoritarian models,and anarchic models. For each solution,we provide a brief description and discuss its advantages and short- comings. We model the different solutions using stochastic Petri nets and provide analytical results and conclusions. Then,we make a general analysis and comparison against some important performance criteria. We finally conclude this paper in Section 7 with the sender. Each public-key is published,and the corresponding private-key is kept secret by the sender. Message encrypted with the sender’s public-key can be decrypted only wit h the sender’s private-key. In general,to send encrypted message to someone,the sender encrypts the message with that receiver’s public-key,and the receiver decrypts it with the corresponding private-key authentication is a service related to identification. This function applies to both entities and information itself. Two parties entering into a communication should identify each other.The public-key certificate is a digital data structure issued by a trusted third party to certify a public-key’s ownership. Among other information a public-key certificate contains: (1) certificate number; (2) issuer’s identity; (3) owner’s identity;(4) owner’s public-key; (5) signature algorithm; (6) period of validity; and (7) the issuer’s signature,and eventually other extensions. CA (Certification Authority) is a trusted third party,which is usually a trustworthy entity for issuing certificates. If the same CA certifies two users,then they would have the same CA in common as a third trust party. The two users would then use the CA’s public-key to verify their exchanged certificates in order to authenticate the included public-keys and use them for identification and secure communication. Each CA might also certify public-keys of other CAs,and collectively forms a hierarchical structure. If different CAs certification two users,they must resort to higher-level CAs until they reach a common CA (cf. Fig. 1).Web-of-trust model does not use CAs. Instead,every entity certifies the binding of identities and public- keys for other entities. For example,an entity u might think it has good knowledge of an entity v and is willing to sign’s public-key certificate. All the certificates issued in the system forms a graph of certificates,named web-of-trust (cf. Fig. 2).2. BackgroundIn this section we recall the definition of some security services using cryptographic mechanisms.2.1. Security services and basic cryptography mechanismsConfidentiality is a service used to keep the content of information from all,but those authorized to have it. Confidentiality is guaranteed using encryption. Encryption is a cryptographic transformation of the message into a form that conceals the message original meaning to prevent it from being known or used. If the transformation is reversible,the corresponding reversal process is called decryption,which is a transformation that restores the encrypted message to its original state. With most modern cryptography,the ability to keep encrypted information secret is based not on the cryptographic encryption algorithm,which is widely known,but on a piece of information called a key that must be used with the algorithm to produce an encrypted result or to decrypt previously encrypted information. Depending on whether the same or different keys are used to encrypt and to decrypt the information We distinguish between two types of encryption systems used to assure confidentiality: Symmetric-key encryption: a secret key is shared between the sender and the receiver and it is used to encrypt the message by the sender and to decrypt itby the receiver. The encryption of the message produces a non-intelligible piece of information; the decryption reproduces the original message. Public-key encryption: also called asymmetric encryption,involves a pair of keys (public and private keys)3. Design issuesThe distribution of public-keys and management of certificates have been widely studied in the case of infrastructure-based networks. In the latter,several issues have been well discussed. However,the certificates management in mobile ad hoc networks addresses additional new issues appeared from the constraints imposed,in particular,by the ad hoc network environment. These issues can be resumed in the following points:Certification service availability issue: In mobile ad hoc networks,due to the frequent link failures,nodes mobility,and limited wireless medium,it is typically not feasible to maintain a fixed centralized authority in the network. Further,in networks requiring high security,such a server could become a single point of failure. One of the primary requirements is to distribute the certification service amongst a set of special nodes (or all nodes) in the network.Resources consumption issue: Since the nodes in mobile ad hoc network typically run on batteries with high power consumption and low memory capacity,the certification service must be resource-aware. That means the time and space complexity of the underlying protocols must be acceptably low in terms of computation,communication,and storage overheads.Scalability issue: Many applications in mobile ad hoc networks involve a large number of nodes. When the certificates management is handled through a centralized authority,the latter may become overloaded due to the number of nodes request. Otherwise,if the certification service is designed in a fully distributed way among several nodes in the network,each participant to the service must maintain a local repository,which contains a maximum number of certificates concerning the other nodes in the network. Hence,the storage overhead will be linear to the network size,which may compromise the system scalability to large ad hoc networks.Handling heterogeneity issue: As in the case of wired networks,the certifying authorities might be heterogeneous even in mobile ad hoc networks. This means that two or more nodes belonging to different domains (mainly in term of certification policy) may try to authenticate each other. In such a case,there must be some kind of trust relationship between the two domains.4. TaxonomyIn Fig. 4,we propose a tax on o my of the existing certification-based trust models for mobile ad hoc networks. We divide existing solutions into two categories depending on the existence or not of central authorities.4.1 Authoritarian modelsIn this category,there exist one or more authorities that are trusted by the whole community of ad hoc nodes. Depending on the number of authorities,this category can be further divided into monopolist models and oligopolist models:1.Monopolist models. In this subcategory,the system is ensured by acertification authority. To cope with the spontaneous nature of mobile ad hoc networks,the service is distributed among several servers,which ensure collectively the CA’s role using a (k,n) threshold cryptography scheme. The CA’s private key is divided into n private-shares,such that each server holds one private-share. In order to deliver a certificate to a given client node,each server creates a partial certificate (certificate signed using a private-share). The system processes the client request,such that the combination of any k partial certificates gives as a result a valid certificate signed by the CA’s private-key.This subcategory is divided into:(a) Single distributed CA,where the certification service,in the whole system,is ensured by only one CA,which is distributed among several servers.(b) Hierarchical CAs,where the certification service is ensured by several homogeneous CAs organized into a hierarchy. Each or some CAs in the system is distributed among several servers. A trust relationship should be established among the different CAs in this case.2.Oligopolist models. In this subcategory,the system is composed ofseveral heterogeneous CAs. Each CA has its own policy of certification. Each or some CAs in the system are distributed among several servers.4.2. Anarchic modelsIn this category of models,there is no central authority. Or in other words,each user acts as an authority independently of other users in the network. The propagation of trust in the network forms what is commonly called web-of-trust. As previously outlined,the web-of-trust is managed by users themselves. This model isdecentralized in nature,and so very adequate for mobile ad hoc networks. In this category of trust models,two main operations are addressed: (1) the initial web-of-trust construction and (2) the certificates chain discovery. This subcategory can be further divided into proactive models and reactive models:1. Proactive models. In this subcategory,the protocol of certificates collection is executed systematically among neighboring nodes. Thus,when the node needs to verify a certificate,it is done instantly since the required chain of certificates would have been already retrieved from the network.2. Reactive models. In this subcategory,the certificates collection protocol is executed on-demand. When the node needs to verify a certificate,it collects in a distributed manner the appropriate chain of certificates from the network. This prolongs the delays of certificates verification.In the following sections,we give detailed descriptions of certification-based trust models belonging to each category. We give for each trust model an overview,advantages,drawbacks,and eventually the proposed extensions. Then,for each category,we give an analytical modeling and an overall comparison with respect to the criteria presented in Section 3.5. Authoritarian modelsIn this section we present and discuss certification-based trust models belonging to the authoritarian models category.5.1. Monopolist modelsIn this class of trust models,the certification service is ensured by a single or several homogeneous CA.5.2. Oligopolist modelsIn this class of trust models,the certification service is composed of several heterogeneous CAs,which each one has its own policy of certification.5.3. Modeling and discussionIn order to measure the degree of the possibility to get a successful certification process,we have opted to model trust models using SPN (Stochastic Petri Network). This model is adequate in the sense that the availability of servers at a given moment for a given node requester is probabilistic and depends on many parameters such as mobility,nodes availability,radio links failure,etc. Then,the servers must collaborate collectively to generate a public-key certificate which requires the synchronization of at least k servers. Indeed,SPNs consist of places and transitions as well as a number of functions. Enabled transitions fire according to exponential distributions; characteristic of Markov processes. It allows the quick construction of a simplified abstract model that is numerically solved for different model parameters. In Fig. 10,we present SPNs corresponding to each trust model belonging to this category,and we note in Table 3 the most used terminology in this subsection.Description……7. ConclusionsIn this paper we focused on certification-based trust models in mobile ad hocnetworks. We provided an overview of the objectives and requirements relating to certificates managements with respect to mobile ad hoc networks environments: service availability,resources awareness,scalability,and handling the heterogeneity. We have classified existing solutions into two approaches: (1) Authoritarian models,where the certification service is provided through one or several certification authorities. In order to take into consideration the above-mentioned requirements,and especially availability and resources awareness,the certification service is distributed among a set of special nodes cooperation to provide the service through threshold cryptography. (2) Anarchic models,where each user in the network considers itself as a certification authority and establishes its own trust relationships according to some rules that may require the cooperation of other users in the network. Again,to take into consideration the above-mentioned requirements,some techniques are used to make certificates chain verification fasterwith low certificates storage overhead. We have further divided these two categoriesinto fine grained sub-categories to illustrate the different organizational and performance aspects of the proposed solutions in the literature. We believe that the proposed taxonomy provides a global and precise insight over existing solutions,with a better understanding of the design choices decided by their authors.In order to measure the service availability degree,we have modeled the reviewed certification-based trust models using SPNs(Stochastic Petri Nets),followed by comparisons and analytical discussions of each trust model. We have showed,in the authoritarian models,that there are two criteria that influence on the certification system availability. The first criterion is the coalition of servers providing the certification service: how to choose the servers? And how many servers can be available to respond to a certification requests? The second criterion is the choice of the threshold value (k). We have studied the impact of these two parameters on the successful certification rate of the existing trust models. This allowed us to further categorize the solutions into performance classes depending on the variation of these parameters dictated by the design of each trust model. In the other category of anarchic models,we have showed that there are two significant criteria that influence on the authentication service availability. The first criterion relates to the management of certificates repository servers,and especially their availability to respond to client nodes requests. The second criterion is the policy nature of certificates chain recovery,and especially,the induced length of certificates chain requiring verification during the certification process. We have then studied the impact of these parameters on the rate of successful service of authentication. This culminated to the categorization of existing solutions into performance classes depending on the design of each trust model.This survey should help shed some light on certification-based trust models in mobile ad hoc networks. It should be especially useful to get a global and precise insight of existing solutions through a fine grained taxonomy and a thorough performance modeling,evaluation and comparison.Journal of Network and Computer Applications2011 Elsevier Ltd.中文译文基于认证的移动网络中的信任模型:调查及分类奥玛拉.马洛德阿尔及利亚倍及亚热赛德巴黎米拉大学亚森.查拉,阿伯丁伊德德·堪培根科技大学,法国国家科学研究院摘要:移动网是一种无线通信网络,不依赖于已有的基础设施或任何的集中管理。
基于数据挖掘的选课推荐系统设计与实现
摘 要 在高校管理系统中,学生信息数据量众多,但对信息的利用率低,无法为学生提供完善的课程推荐服务。
提出利用数据挖掘技术构建学生个性化的选课推荐系统,首先,分析学生行为特征,提取学生的个性特征并构建学生的用户画像;其次,根据Apriori 算法对课程信息进行关联分析,挖掘课程之间的关联性,优化选课推荐集。
通过个性化推荐选课服务,促进学生个性化学习,使学生更好地利用学校资源。
关键词 数据挖掘;选课推荐系统;用户画像;关联规则;Apriori 算法中图分类号:G642 文献标识码:B 文章编号:1671-489X(2020)16-0012-03Design of Course Selection Recommendation System based on Data Mining //FENG Chusheng, DU XiaomingAbstract For today’s college management systems, there is a lot of student information data, but the interest rate of the school for information is low, and it cannot provide students with comprehen -sive student course management and course recommendation ser -vices. This paper proposes to use data mining technology to con -struct a personalized recommendation system for students. We ana -lyze the behavioral characteristics of students, extract the personality characteristics of students and construct student portraits of students, recommend courses based on the characteristics of student portraits, and then use the Apriori algorithm to conduct course information. Association analysis, mining the correlation between courses, and optimizing the set of recommended courses. Through personalized recommendation course selection service, students’ personalized learning can be improved, students’ learning dynamics can be under-stood, students can make better use of school resources, further im-prove the school’s teaching services, and improve the school’s tea-ching quality.Key words data mining; course selection recommendation system; association rules; Apriori algorithm1 引言10.3969/j.issn.1671-489X.2020.16.012基于数据挖掘的选课推荐系统设计与实现*◆冯楚生 杜晓明随着信息技术的发展,数据产生的渠道也在迅猛增多,随之而来的数据库中包含的数据量也在呈指数增加趋势,从收集到的数据中找到有用信息的方法就变得尤为重要。
Toward the Next Generation of Recommender Systems A Survey of the State-of-the-Art and Possible Exte
Toward the Next Generation of Recommender Systems:A Survey of the State-of-the-Art andPossible ExtensionsGediminas Adomavicius,Member,IEEE,and Alexander Tuzhilin,Member,IEEE Abstract—This paper presents an overview of the field of recommender systems and describes the current generation ofrecommendation methods that are usually classified into the following three main categories:content-based,collaborative,and hybrid recommendation approaches.This paper also describes various limitations of current recommendation methods and discussespossible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications.These extensions include,among others,an improvement of understanding of users and items,incorporation of the contextual information into the recommendation process,support for multcriteria ratings,and a provision of more flexible and less intrusive types of recommendations.Index Terms—Recommender systems,collaborative filtering,rating estimation methods,extensions to recommender systems.æ1I NTRODUCTIONR ECOMMENDER systems have become an important research area since the appearance of the first papers on collaborative filtering in the mid-1990s[45],[86],[97]. There has been much work done both in the industry and academia on developing new approaches to recommender systems over the last decade.The interest in this area still remains high because it constitutes a problem-rich research area and because of the abundance of practical applications that help users to deal with information overload and provide personalized recommendations, content,and services to them.Examples of such applica-tions include recommending books,CDs,and other products at [61],movies by MovieLens [67],and news at VERSIFI Technologies(formerly )[14].Moreover,some of the vendors have incorporated recommendation capabilities into their commerce servers[78].However,despite all of these advances,the current generation of recommender systems still requires further improvements to make recommendation methods more effective and applicable to an even broader range of real-life applications,including recommending vacations,certain types of financial services to investors,and products to purchase in a store made by a“smart”shopping cart[106]. These improvements include better methods for represent-ing user behavior and the information about the items to be recommended,more advanced recommendation modeling methods,incorporation of various contextual information into the recommendation process,utilization of multcriteria ratings,development of less intrusive and more flexible recommendation methods that also rely on the measures that more effectively determine performance of recommen-der systems.In this paper,we describe various ways to extend the capabilities of recommender systems.However,before doing this,we first present a comprehensive survey of the state-of-the-art in recommender systems in Section2.Then, we identify various limitations of the current generation of recommendation methods and discuss some initial ap-proaches to extending their capabilities in Section3.2T HE S URVEY OF R ECOMMENDER S YSTEMS Although the roots of recommender systems can be traced back to the extensive work in cognitive science[87], approximation theory[81],information retrieval[89], forecasting theories[6],and also have links to management science[71]and to consumer choice modeling in marketing [60],recommender systems emerged as an independent research area in the mid-1990s when researchers started focusing on recommendation problems that explicitly rely on the ratings structure.In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user.Intuitively,this estimation is usually based on the ratings given by this user to other items and on some other information that will be formally described below.Once we can estimate ratings for the yet unrated items,we can recommend to the user the item(s)with the highest estimated rating(s).More formally,the recommendation problem can be formulated as follows:Let C be the set of all users and let S be the set of all possible items that can be recommended, such as books,movies,or restaurants.The space S of.G.Adomavicius is with the Carlson School of Management,University ofMinnesota,32119th Avenue South,Minneapolis,MN55455.E-mail:gedas@.. A.Tuzhilin is with the Stern School of Business,New York University,44West4th Street,New York,NY10012.E-mail:atuzhili@.Manuscript received8Mar.2004;revised14Oct.2004;accepted10Dec.2004;published online20Apr.2005.For information on obtaining reprints of this article,please send e-mail to:tkde@,and reference IEEECS Log Number TKDE-0071-0304.1041-4347/05/$20.00ß2005IEEE Published by the IEEE Computer Societypossible items can be very large,ranging in hundreds of thousands or even millions of items in some applications,such as recommending books or CDs.Similarly,the user space can also be very large—millions in some cases.Let u be a utility function that measures the usefulness of item s to user c ,i.e.,u :C ÂS !R ,where R is a totally ordered set (e.g.,nonnegative integers or real numbers within a certain range).Then,for each user c 2C ,we want to choose such item s 02S that maximizes the user’s utility.More formally:8c 2C;s 0c ¼arg max s 2Su ðc;s Þ:ð1ÞIn recommender systems,the utility of an item is usually represented by a rating ,which indicates how a particular user liked a particular item,e.g.,John Doe gave the movie “Harry Potter”the rating of 7(out of 10).However,as indicated earlier,in general,utility can be an arbitrary function,including a profit function.Depending on the application,utility u can either be specified by the user,as is often done for the user-defined ratings,or is computed by the application,as can be the case for a profit-based utility function.Each element of the user space C can be defined with a profile that includes various user characteristics,such as age,gender,income,marital status,etc.In the simplest case,the profile can contain only a single (unique)element,such as User ID.Similarly,each element of the item space S is defined with a set of characteristics.For example,in a movie recommendation application,where S is a collection of movies,each movie can be represented not only by its ID,but also by its title,genre,director,year of release,leading actors,etc.The central problem of recommender systems lies in that utility u is usually not defined on the whole C ÂS space,but only on some subset of it.This means u needs to be extrapolated to the whole space C ÂS .In recommender systems,utility is typically represented by ratings and is initially defined only on the items previously rated by the users.For example,in a movie recommendation application (such as the one at ),users initially rate some subset of movies that they have already seen.An example of a user-item rating matrix for a movie recommendation application is presented in Table 1,where ratings are specified on the scale of 1to 5.The “ ”symbol for some of the ratings in Table 1means that the users have not rated the corresponding movies.Therefore,the recommendation engine should be able to estimate (predict)the ratings of the nonrated movie/user combinations and issue appropriate recommendations based on these predictions.Extrapolations from known to unknown ratings are usually done by 1)specifying heuristics that define the utility function and empirically validating its performanceand 2)estimating the utility function that optimizes certain performance criterion,such as the mean square error.Once the unknown ratings are estimated,actual recommendations of an item to a user are made by selecting the highest rating among all the estimated ratings for that user,according to (1).Alternatively,we can recommend the N best items to a user or a set of users to an item.The new ratings of the not-yet-rated items can be estimated in many different ways using methods from machine learning,approximation theory,and various heuristics.Recommender systems are usually classified according to their approach to rating estimation and,in the next section,we will present such a classification that was proposed in the literature and will provide a survey of different types of recommender systems.The commonly accepted formulation of the recommendation problem was first stated in [45],[86],[97]and this problem has been studied extensively since then.Moreover,recommender systems are usually classified into the following categories,based on how recommendations are made [8]:.Content-based recommendations :The user will be recommended items similar to the ones the user preferred in the past;.Collaborative recommendations :The user will berecommended items that people with similar tastes and preferences liked in the past;.Hybrid approaches :These methods combine colla-borative and content-based methods.In addition to recommender systems that predict the absolute values of ratings that individual users would give to the yet unseen items (as discussed above),there has been work done on preference-based filtering ,i.e.,predicting the relative preferences of users [22],[35],[51],[52].For example,in a movie recommendation application,prefer-ence-based filtering techniques would focus on predicting the correct relative order of the movies,rather than their individual ratings.However,this paper focuses primarily on rating-based recommenders since it constitutes the most popular approach to recommender systems.2.1Content-Based MethodsIn content-based recommendation methods,the utility u ðc;s Þof item s for user c is estimated based on the utilities u ðc;s i Þassigned by user c to items s i 2S that are “similar”to item s .For example,in a movie recommendation application,in order to recommend movies to user c ,the content-based recommender system tries to understand the commonalities among the movies user c has rated highly in the past (specific actors,directors,genres,subject matter,TABLE 1A Fragment of a Rating Matrix for a Movie Recommender Systemetc.).Then,only the movies that have a high degree of similarity to whatever the user’s preferences are would be recommended.The content-based approach to recommendation has its roots in information retrieval[7],[89]and information filtering[10]research.Because of the significant and early advancements made by the information retrieval and filtering communities and because of the importance of several text-based applications,many current content-based systems focus on recommending items containing textual information,such as documents,Web sites(URLs),and Usenet news messages.The improvement over the tradi-tional information retrieval approaches comes from the use of user profiles that contain information about users’tastes, preferences,and needs.The profiling information can be elicited from users explicitly,e.g.,through questionnaires, or implicitly—learned from their transactional behavior over time.More formally,let ContentðsÞbe an item profile,i.e.,a set of attributes characterizing item s.It is usually computed by extracting a set of features from item s(its content)and is used to determine the appropriateness of the item for recommendation purposes.Since,as mentioned earlier, content-based systems are designed mostly to recommend text-based items,the content in these systems is usually described with keywords.For example,a content-based component of the Fab system[8],which recommends Web pages to users,represents Web page content with the 100most important words.Similarly,the Syskill&Webert system[77]represents documents with the128most informative words.The“importance”(or“informative-ness”)of word k j in document d j is determined with some weighting measure w ij that can be defined in several different ways.One of the best-known measures for specifying keyword weights in Information Retrieval is the term frequency/inverse document frequency(TF-IDF)measure[89]that is defined as follows:Assume that N is the total number of documents that can be recommended to users and that keyword k j appears in n i of them.Moreover,assume that f i;j is the number of times keyword k i appears in document d j.Then, T F i;j,the term frequency(or normalized frequency)of keyword k i in document d j,is defined asT F i;j¼f i;jmax z f z;j;ð2Þwhere the maximum is computed over the frequencies f z;j of all keywords k z that appear in the document d j. However,keywords that appear in many documents are not useful in distinguishing between a relevant document and a nonrelevant one.Therefore,the measure of inverse document frequencyðIDF iÞis often used in combination with simple term frequencyðT F i;jÞ.The inverse document frequency for keyword k i is usually defined asIDF i¼log Nn i:ð3ÞThen,the TF-IDF weight for keyword k i in document d j is defined asw i;j¼T F i;jÂIDF ið4Þand the content of document d j is defined asContentðd jÞ¼ðw1j;...w kjÞ:As stated earlier,content-based systems recommend items similar to those that a user liked in the past[56],[69], [77].In particular,various candidate items are compared with items previously rated by the user and the best-matching item(s)are recommended.More formally,let ContentBasedP rofileðcÞbe the profile of user c containing tastes and preferences of this user.These profiles are obtained by analyzing the content of the items previously seen and rated by the user and are usually constructed using keyword analysis techniques from information retrieval.For example,ContentBasedP rofileðcÞcan be defined as a vector of weightsðw c1;...;w ckÞ,where each weight w ci denotes the importance of keyword k i to user c and can be computed from individually rated content vectors using a variety of techniques.For example,some averaging approach,such as Rocchio algorithm[85],can be used to compute ContentBasedP rofileðcÞas an“average”vector from an individual content vectors[8],[56].On the other hand,[77]uses a Bayesian classifier in order to estimate the probability that a document is liked.The Winnow algorithm[62]has also been shown to work well for this purpose,especially in the situations where there are many possible features[76].In content-based systems,the utility function uðc;sÞis usually defined as:uðc;sÞ¼scoreðContentBasedP rofileðcÞ;ContentðsÞÞ:ð5ÞUsing the above-mentioned information retrieval-based paradigm of recommending Web pages,Web site URLs, or Usenet news messages,both ContentBasedP rofileðcÞof user c and ContentðsÞof document s can be represented as TF-IDF vectors~w c and~w s of keyword weights.Moreover, utility function uðc;sÞis usually represented in the information retrieval literature by some scoring heuristic defined in terms of vectors~w c and~w s,such as the cosine similarity measure[7],[89]:uðc;sÞ¼cosð~w c;~w sÞ¼~w cÁ~w sjj~w c jj2Âjj~w s jj2¼P Ki¼1w i;c w i;sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP Ki¼1w2i;cqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP Ki¼1w2i;sq;ð6Þwhere K is the total number of keywords in the system.For example,if user c reads many online articles on the topic of bioinformatics,then content-based recommenda-tion techniques will be able to recommend other bioinfor-matics articles to user c.This is the case because these articles will have more bioinformatics-related terms(e.g.,“genome,”“sequencing,”“proteomics”)than articles on other topics and,therefore,ContentBasedP rofileðcÞ,as defined by vector~w c,will represent such terms k i with high weights w ic.Consequently,a recommender system using the cosine or a related similarity measure will assign higher utility uðc;sÞto those articles s that have high-weighted bioinformatics terms in~w s and lower utility to the ones where bioinformatics terms are weighted less.Besides the traditional heuristics that are based mostly on information retrieval methods,other techniques for content-based recommendation have also been used,such as Bayesian classifiers[70],[77]and various machine learning techniques,including clustering,decision trees, and artificial neural networks[77].These techniques differ from information retrieval-based approaches in that they calculate utility predictions based not on a heuristic formula,such as a cosine similarity measure,but rather are based on a model learned from the underlying data using statistical learning and machine learning techni-ques.For example,based on a set of Web pages that were rated as“relevant”or“irrelevant”by the user,[77]uses the naive Bayesian classifier[31]to classify unrated Web pages.More specifically,the naive Bayesian classifier is used to estimate the following probability that page p j belongs to a certain class C i(e.g.,relevant or irrelevant) given the set of keywords k1;j;...;k n;j on that page:PðC i j k1;j&...&k n;jÞ:ð7ÞMoreover,[77]uses the assumption that keywords are independent and,therefore,the above probability is proportional toPðC iÞYxPðk x;j j C iÞ:ð8ÞWhile the keyword independence assumption does not necessarily apply in many applications,experimental results demonstrate that naı¨ve Bayesian classifiers still produce high classification accuracy[77].Furthermore,both Pðk x;j j C iÞand PðC iÞcan be estimated from the underlying training data.Therefore,for each page p j,the probability PðC i j k1;j&...&k n;jÞis computed for each class C i and page p j is assigned to class C i having the highest probability[77].While not explicitly dealing with providing recommen-dations,the text retrieval community has contributed several techniques that are being used in content-based recommen-der systems.One example of such a technique would be the research on adaptive filtering[101],[112],which focuses on becoming more accurate at identifying relevant documents incrementally by observing the documents one-by-one in a continuous document stream.Another example would be the work on threshold setting[84],[111],which focuses on determining the extent to which documents should match a given query in order to be relevant to the user.Other text retrieval methods are described in[50]and can also be found in the proceedings of the Text Retrieval Conference (TREC)().As was observed in[8],[97],content-based recommender systems have several limitations that are described in the rest of this section.2.1.1Limited Content AnalysisContent-based techniques are limited by the features that are explicitly associated with the objects that these systems recommend.Therefore,in order to have a sufficient set of features,the content must either be in a form that can be parsed automatically by a computer(e.g.,text)or the features should be assigned to items manually.While information retrieval techniques work well in extracting features from text documents,some other domains have an inherent problem with automatic feature extraction.For example,automatic feature extraction methods are much harder to apply to multimedia data,e.g.,graphical images, audio streams,and video streams.Moreover,it is often not practical to assign attributes by hand due to limitations of resources[97].Another problem with limited content analysis is that,if two different items are represented by the same set of features,they are indistinguishable.Therefore,since text-based documents are usually represented by their most important keywords,content-based systems cannot distin-guish between a well-written article and a badly written one,if they happen to use the same terms[97].2.1.2OverspecializationWhen the system can only recommend items that score highly against a user’s profile,the user is limited to being recommended items that are similar to those already rated. For example,a person with no experience with Greek cuisine would never receive a recommendation for even the greatest Greek restaurant in town.This problem,which has also been studied in other domains,is often addressed by introducing some randomness.For example,the use of genetic algorithms has been proposed as a possible solution in the context of information filtering[98].In addition,the problem with overspecialization is not only that the content-based systems cannot recommend items that are different from anything the user has seen before.In certain cases,items should not be recommended if they are too similar to something the user has already seen,such as a different news article describing the same event.Therefore, some content-based recommender systems,such as Daily-Learner[13],filter out items not only if they are too different from the user’s preferences,but also if they are too similar to something the user has seen before.Furthermore,Zhang et al.[112]provide a set of five redundancy measures to evaluate whether a document that is deemed to be relevant contains some novel information as well.In summary,the diversity of recommendations is often a desirable feature in recommender systems.Ideally,the user should be pre-sented with a range of options and not with a homogeneous set of alternatives.For example,it is not necessarily a good idea to recommend all movies by Woody Allen to a user who liked one of them.2.1.3New User ProblemThe user has to rate a sufficient number of items before a content-based recommender system can really understand the user’s preferences and present the user with reliable recommendations.Therefore,a new user,having very few ratings,would not be able to get accurate recommendations.2.2Collaborative MethodsUnlike content-based recommendation methods,collabora-tive recommender systems(or collaborative filtering systems) try to predict the utility of items for a particular user based on the items previously rated by other users.More formally, the utility uðc;sÞof item s for user c is estimated based on the utilities uðc j;sÞassigned to item s by those users c j2C who are“similar”to user c.For example,in a movierecommendation application,in order to recommend movies to user c ,the collaborative recommender system tries to find the “peers”of user c ,i.e.,other users that have similar tastes in movies (rate the same movies similarly).Then,only the movies that are most liked by the “peers”of user c would be recommended.There have been many collaborative systems developed in the academia and the industry.It can be argued that the Grundy system [87]was the first recommender system,which proposed using stereotypes as a mechanism for building models of users based on a limited amount of information on each individual ing stereotypes,the Grundy system would build individual user models and use them to recommend relevant books to each ter on,the Tapestry system relied on each user to identify like-minded users manually [38].GroupLens [53],[86],Video Recommender [45],and Ringo [97]were the first systems to use collaborative filtering algorithms to automate prediction.Other examples of collaborative recommender systems include the book recommendation system from ,the PHOAKS system that helps people find relevant information on the WWW [103],and the Jester system that recommends jokes [39].According to [15],algorithms for collaborative recom-mendations can be grouped into two general classes:memory-based (or heuristic-based )and model-based .Memory-based algorithms [15],[27],[72],[86],[97]essentially are heuristics that make rating predictions based on the entire collection of previously rated items by the users.That is,the value of the unknown rating r c;s for user c and item s is usually computed as an aggregate of the ratings of some other (usually,the N most similar)users for the same item s :r c;s ¼aggr c 02^Cr c 0;s ;ð9Þwhere ^Cdenotes the set of N users that are the most similar to user c and who have rated item s (N can range anywhere from 1to the number of all users).Some examples of the aggregation function are:ða Þr c;s ¼1N Xc 02^C r c 0;s ;ðb Þr c;s¼k X c 02^Csim ðc;c 0ÞÂr c 0;s ;ðc Þr c;s ¼"rc þk Xc 02^Csim ðc;c 0ÞÂðr c 0;s À"rc 0Þ;ð10Þwhere multiplier k serves as a normalizing factor and is usually selected as k ¼1 P c 02^Cj sim ðc;c 0Þj ,and where the average rating of user c ,"rc ,in (10c)is defined as 1"r c ¼À1 j S c j ÁX s 2S cr c;s;where S c ¼f s 2S j r c;s ¼ g :ð11ÞIn the simplest case,the aggregation can be a simple average,as defined by (10a).However,the most common aggregation approach is to use the weighted sum,shown in (10b).The similarity measure between users c and c 0,sim ðc;c 0Þ,is essentially a distance measure and is used as aweight,i.e.,the more similar users c and c 0are,the more weight rating r c 0;s will carry in the prediction of r c;s .Note that sim ðx;y Þis a heuristic artifact that is introduced in order to be able to differentiate between levels of user similarity (i.e.,to be able to find a set of “closest peers”or “nearest neighbors”for each user)and,at the same time,simplify the rating estimation procedure.As shown in (10b),different recommendation applications can use their own user similarity measure as long as the calculations are normalized using the normalizing factor k ,as shown above.The two most commonly used similarity measures will be described below.One problem with using the weighted sum,as in (10b),is that it does not take into account the fact that different users may use the rating scale differently.The adjusted weighted sum,shown in (10c),has been widely used to address this limitation.In this approach,instead of using the absolute values of ratings,the weighted sum uses their deviations from the average rating of the correspond-ing user.Another way to overcome the differing uses of the rating scale is to deploy preference-based filtering [22],[35],[51],[52],which focuses on predicting the relative prefer-ences of users instead of absolute rating values,as was pointed out earlier in Section 2.Various approaches have been used to compute the similarity sim ðc;c 0Þbetween users in collaborative recom-mender systems.In most of these approaches,the similarity between two users is based on their ratings of items that both users have rated.The two most popular approaches are correlation and cosine-based .To present them,let S xy be the set of all items corated by both users x and y ,i.e.,S xy ¼f s 2S j r x;s ¼ &r y;s ¼ g .In collaborative recom-mender systems,S xy is used mainly as an intermediate result for calculating the “nearest neighbors”of user x and is often computed in a straightforward manner,i.e.,by computing the intersection of sets S x and S y .However,some methods,such as the graph-theoretic approach to collaborative filtering [4],can determine the nearest neighbors of x without computing S xy for all users y .In the correlation-based approach,the Pearson correlation coefficient is used to measure the similarity [86],[97]:sim ðx;y Þ¼Ps 2S xyðr x;s À"rx Þðr y;s À"r y ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xyðr x;s À"r x Þ2P s 2S xyðr y;s À"ry Þ2r :ð12ÞIn the cosine-based approach [15],[91],the two users x and y are treated as two vectors in m -dimensional space,where m ¼j S xy j .Then,the similarity between two vectors can be measured by computing the cosine of the angle between them:sim ðx;y Þ¼cos ð~x ;~y Þ¼~x Á~yjj ~x jj 2Âjj ~y jj 2¼Ps 2S xy r x;s r y;s ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xyr 2x;sr ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP s 2S xyr2y;sr ;ð13Þwhere ~x Á~y denotes the dot-product between the vectors ~xand ~y .Still another approach to measuring similarity between users uses the mean squared difference measure1.We use the r c;s ¼ notation to indicate that item s has not been rated by user c .。
Content-based Recommendation Systems
10Content-based Recommendation SystemsMichael J. Pazzani1and Daniel Billsus21 Rutgers University, ASBIII, 3 Rutgers PlazaNew Brunswick, NJ 08901pazzani@2 FX Palo Alto Laboratory, Inc., 3400 Hillview Ave, Bldg. 4Palo Alto, CA 94304billsus@Abstract. This chapter discusses content-based recommendation systems, i.e.,systems that recommend an item to a user based upon a description of the itemand a profile of the user’s interests. Content-based recommendation systemsmay be used in a variety of domains ranging from recommending web pages,news articles, restaurants, television programs, and items for sale. Although thedetails of various systems differ, content-based recommendation systems sharein common a means for describing the items that may be recommended, ameans for creating a profile of the user that describes the types of items the userlikes, and a means of comparing items to the user profile to determine what torecommend. The profile is often created and updated automatically in responseto feedback on the desirability of items that have been presented to the user.10.1 IntroductionA common scenario for modern recommendation systems is a Web application with which a user interacts. Typically, a system presents a summary list of items to a user, and the user selects among the items to receive more details on an item or to interact with the item in some way. For example, online news sites present web pages with headlines (and occasionally story summaries) and allow the user to select a headline to read a story. E-commerce sites often present a page with a list of individual products and then allow the user to see more details about a selected product and purchase the product. Although the web server transmits HTML and the user sees a web page, the web server typically has a database of items and dynamically constructs web pages with a list of items. Because there are often many more items available in a database than would easily fit on a web page, it is necessary to select a subset of items to display to the user or to determine an order in which to display the items.Content-based recommendation systems analyze item descriptions to identify items that are of particular interest to the user. Because the details of recommendation systems differ based on the representation of items, this chapter first discusses alternative item representations. Next, recommendation algorithms suited for each representation are discussed. The chapter concludes with a discussion of variants of the approaches, the strengths and weaknesses of content-based recommendation systems, and directions for future research and development.10.1.1Item RepresentationItems that can be recommended to the user are often stored in a database table. Table 10.1 shows a simple database with records (i.e., “rows”) that describe three restaurants. The column names (e.g., Cuisine or Service) are properties of restaurants. These properties are also called “attributes,” “characteristics,” “fields,” or “variables” in different publications. Each record contains a value for each attribute. A unique identifier, ID in Table 10.1, allows items with the same name to be distinguished and serves as a key to retrieve the other attributes of the record.Table 10.1. A restaurant databaseID Name Cuisine Service CostLow 10001 Mike’s Pizza Italian Counter10002 Chris’s Cafe French Table MediumBistro French Table High Jacques10003The database depicted in Table 10.1 could be used to drive a web site that lists and recommends restaurants. This is an example of structured data in which there is a small number of attributes, each item is described by the same set of attributes, and there is a known set of values that the attributes may have. In this case, many machine learning algorithms may be used to learn a user profile, or a menu interface can easily be created to allow a user to create a profile. The next section of this chapter discusses several approaches to creating a user profile from structured data.Of course, a web page typically has more information than is shown in Table 10.1, such as a text description of the restaurant, a restaurant review, or even a menu. These may easily be stored as additional fields in the database and a web page can be created with templates to display the text fields (as well as the structured data). However, free text data creates a number of complications when learning a user profile. For example, a profile might indicate that there is an 80% probability that a particular user would like a French restaurant. This might be added to the profile because a user gave a positive review of four out of five French restaurants. However, unrestricted text fields are typically unique and there would be no opportunity to provide feedback on five restaurants described as “A charming café with attentive staff overlooking the river.”An extreme example of unstructured data may occur in news articles. Table 10.2 shows an example of a part of a news article. The entire article can be treated as a large unrestricted text field.Table 10.2. Part of a newspaper articleLawmakers Fine-Tuning Energy PlanSACRAMENTO, Calif. -- With California's energy reserves remaining all but depleted, lawmakers prepared to work through the weekend fine-tuning a plan Gov.Gray Davis says will put the state in the power business for "a long time to come."The proposal involves partially taking over California's two largest utilities and signing long-term contracts of up to 10 years to buy electricity from wholesalers. Unrestricted texts such as news articles are examples of unstructured data. Unlike structured data, there are no attribute names with well-defined values. Furthermore, the full complexity of natural language may be present in the text field including polysemous words (the same word may have several meanings) and synonyms (different words may have the same meaning). For example, in the article in Table 10.2, “Gray” is a name rather than a color, and “power” and “electricity” refer to the same underlying concept.Many domains are best represented by semi-structured data in which there are some attributes with a set of restricted values and some free-text fields. A common approach to dealing with free text fields is to convert the free text to a structured representation. For example, each word may be viewed as an attribute, with a Boolean value indicating whether the word is in the article or with an integer value indicating the number of times the word appears in the article.Many personalization systems that deal with unrestricted text use a technique to create a structured representation that originated with text search systems [34]. In this formalism, rather than using words, the root forms of words are typically created through a process called stemming [30]. The goal of stemming is to create a term that reflects the common meaning behind words such as “compute,” “computation,” “computer” “computes” and “computers.” The value of a variable associated with a term is a real number that represents the importance or relevance. This value is called the tf*idf weight (term-frequency times inverse document frequency). The tf*idf weight, w(t,d), of a term t in a document d is a function of the frequency of t in the document (tf t,d), the number of documents that contain the term (df t) and the number of documents in the collection (N).11 Note that in the description of tf*idf weights, the word “document” is traditionally used since the original motivation was to retrieve documents. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called tf*idf. In general, tf*idf systems have weights that increase monotonically with term frequency and decrease monotonically with document frequency.22,,log )(log ),(∑⎟⎟⎠⎞⎜⎜⎝⎛⎟⎟⎠⎞⎜⎜⎝⎛=i t d t td t ii df N tf df N tf d t w (10.1)Table 10.3 shows the tf*idf representation (also called the vector space representation) of the complete article excerpted in Table 10.2. The terms are ordered by the tf*idf weight. The intuition behind the weight is that the terms with the highest weight occur more often in that document than in the other documents, and therefore are more central to the topic of the document. Note that terms such as “util” (a stem of “utility”), “power,” “megawatt,” are among the highest weighted terms capturing the meaning.Table 10.3. tf*idf representation of the article in Table 10.2util-0.339 power-0.329 megawatt-0.309 electr-0.217 energi-0.206 california-0.181 debt-0.128 lawmak-0.128 state-0.122 wholesal-0.119 partial-0.106 consum-0.105 alert-0.103 scroung-0.096 advoc-0.09 testi-0.088 bail-out-0.088 crisi-0.085 amid-0.084 price-0.083 long-0.082 bond-0.081 plan-0.081 term-0.08 grid-0.078 reserv-0.077 blackout-0.076 bid-0.076 market-0.074 fine-0.073 deregul-0.07 spiral-0.068 deplet-0.068 liar-0.066.Of course, this representation does not capture the context in which a word is used. It loses the relationships between words in the description. For example, a description of a steak house might contain the sentence, “there is nothing on the menu that a vegetarian would like” while the description of a vegetarian restaurant might mention “vegan” rather than vegetarian. In a manually created structured database, the cuisine attribute having a value of “vegetarian” would indicate that the restaurant is indeed a vegetarian one. In contrast, when converting an unstructured text description to structured data, the presence of the word vegetarian does not always indicate that a restaurant is vegetarian and the absence of the word vegetarian does not always indicate that the restaurant is not a vegetarian restaurant. As a consequence, techniques for creating user profiles that deal with structured data need to differ somewhat from those techniques that deal with unstructured data or unstructured data automatically and imprecisely converted to structured data.One variant on using words as terms is to use sets of contiguous words as terms. For example, in the article in Table 10.2, terms such as “energy reserves” and “power business” might be more descriptive of the content than these words treated as individual terms. Of course, terms such as “all but” would also be included, but one would expect that these have very low weights, in the same way that “all” and “but” individually have low weights and are not among the most important terms in Table 10.3.10.2User ProfilesA profile of the user’s interests is used by most recommendation systems. This profile may consist of a number of different types of information. Here, we concentrate on two types of information:1.A model of the user’s preferences, i.e., a description of the types of items thatinterest the user. There are many possible alternative representations of this description, but one common representation is a function that for any item predicts the likelihood that the user is interested in that item. For efficiency purposes, this function may be used to retrieve the n items most likely to be of interest to the user.2.A history of the user’s interactions with the recommendation system. This mayinclude storing the items that a user has viewed together with other information about the user’s interaction, (e.g., whether the user has purchased the item or a rating that the user has given the item). Other types of history include saving queries typed by the user (e.g., that a user searched for an Italian restaurant in the 90210 zip code).There are several uses of the history of user interactions. First, the system can simply display recently visited items to facilitate the user returning to these items. Second, the system can filter out from a recommendation system an item that the user has already purchased or read.2 Another important use of the history in content-based recommendation systems is to serve as training data for a machine learning algorithm that creates a user model. The next section will discuss several different approaches to learning a user model. Here, we briefly describe approaches of manually providing the information used by recommendation systems: user customization and rule-based recommendation systems.In user customization, a recommendation system provides an interface that allows users to construct a representation of their own interests. Often check boxes are used to allow a user to select from the known values of attributes, e.g., the cuisine of restaurants, the names of favorite sports teams, the favorite sections of a news site, or the genre of favorite movies. In other cases, a form allows a user to type words that occur in the free text descriptions of items, e.g., the name of a musician or author that interests the user. Once the user has entered this information, a simple database matching process is used to find items that meet the specified criteria and display them to the user.There are several limitations of user customization systems. First, they require effort from the user and it is difficult to get many users to make this effort. This is particularly true when the user’s interests change, e.g., a user may not follow football 2 Of course, in some situations it is appropriate to recommend an item the user has purchased and in other situations it is not. For example, a system should continue to recommend an item that wears out or is expended, such as a razor blade or print cartridge, while there is little value in recommending a CD or DVD a user owns.during the season but then become interested in the Superbowl. Second, customization systems do not provide a way to determine the order in which to present items and can find either too few or too many matching items to display.Figure 10.1 shows book recommendations at . Although is usually thought of as a good example of collaborative recommendation (see Chapter 9 of this book [35]), parts of the user’s profile can be viewed as a content-based profile. For example, Amazon contains a feature called “favorites” that represents the categories of items preferred by users. These favorites are either calculated by keeping track of the categories of items purchased by users or may be set manually by the user. Figure 10.2 shows an example of a user customization interface in which a user can select the categories.In rule-based recommendation systems, the recommendation system has rules to recommend other products based on the user history. For example, a system may contain a rule that recommends the sequel to a book or movie to people who have purchased the early item in the series. Another rule might recommend a new CD by an artist to users that purchased earlier CDs by that artist. Rule-based systems may capture several common reasons for making recommendations, but they do not offer the same detailed personalized recommendations that are available with other recommendation systems.Fig. 10.1. Book recommendations by .Fig. 10.2. User customization in 10.3Learning a User ModelCreating a model of the user’s preference from the user history is a form of classification learning. The training data of a classification learner is divided into categories, e.g., the binary categories “items the user likes” and “items the user doesn’t like.” This is accomplished either through explicit feedback in which the user rates items via some interface for collecting feedback or implicitly by observing the user’s interactions with items. For example, if a user purchases an item, that is a sign that the user likes the item, while if the user purchases and returns the item that is a sign that the user doesn’t like the item. In general, there is a tradeoff since implicit methods can collect a large amount of data with some uncertainty as to whether the user actually likes the item. In contrast, when the user explicitly rates items, there is little or no noise in the training data, but users tend to provide explicit feedback on only a small percentage of the items they interact with.Figure 10.3 shows an example of a recommendation system with explicit user feedback. The recommender “MyBestBets” by ChoiceStream is a web based inter-face to a television recommendation system. Users can click on the thumbs up or thumbs down buttons to indicate whether they like the program that is recommended. By necessity, this system requires explicit feedback because it is not integrated with atelevision [1] and cannot infer the user’s interests by observing the user’s behavior.Fig. 10.3. A recommendation system using explicit feedbackThe next section reviews a number of classification learning algorithms. Such algorithms are the key component of content-based recommendation systems, because they learn a function that models each user’s interests. Given a new item and the user model, the function predicts whether the user would be interested in the item. Many of the classification learning algorithms create a function that will provide an estimate of the probability that a user will like an unseen item. This probability may be used to sort a list of recommendations. Alternatively, an algorithm may create a function that directly predicts a numeric value such as the degree of interest.Some of the algorithms below are traditional machine learning algorithms de-signed to work on structured data. When they operate on free text, the free text is first converted to structured data by selecting a small subset of the terms as attributes. In contrast, other algorithms are designed to work in high dimensional spaces and do not require a preprocessing step of feature selection.10.4Decision Trees and Rule InductionDecision tree learners such as ID3 [31] build a decision tree by recursively partitioning training data, in this case text documents, into subgroups until thosesubgroups contain only instances of a single class. A partition is formed by a test onsome feature -- in the context of text classification typically the presence or absence of an individual word or phrase. Expected information gain is a commonly used criterion to select the most informative features for the partition tests [38].Decision trees have been studied extensively in use with structured data such as that shown in Table 10.1. Given feedback on the restaurants, a decision tree can easily represent and learn a profile of someone who prefers to eat in expensive French restaurants or inexpensive Mexican restaurants. Arguably, the decision tree bias is not ideal for unstructured text classification tasks [29]. As a consequence of the information-theoretic splitting criteria used by decision tree learners, the inductive bias of decision trees is a preference for small trees with few tests. However, it can be shown experimentally that text classification tasks frequently involve a large number of relevant features [17]. Therefore, a decision tree’s tendency to base classifications on as few tests as possible can lead to poor performance on text classification. However, when there are a small number of structured attributes, the performance, simplicity and understandability of decision trees for content-based models are all advantages. Kim et al. [18] describe an application of decision trees for personalizing advertisements on web pages.RIPPER [9] is a rule induction algorithm closely related to decision trees that operates in a similar fashion to the recursive data partitioning approach described above. Despite the problematic inductive bias, however, RIPPER performs competitively with other state-of-the-art text classification algorithms. In part, the performance can be attributed to a sophisticated post-pruning algorithm that optimizes the fit of the induced rule set with respect to the training data as a whole. Furthermore, RIPPER supports multi-valued attributes, which leads to a natural representation for text classification tasks, i.e., the individual words of a text document can be represented as multiple feature values for a single feature. While this is essentially a representational convenience if rules are to be learned from unstructured text documents, the approach can lead to more powerful classifiers for semi-structured text documents. For example, the text contained in separate fields of an email message, such as sender, subject, and body text, can be represented as separate multi-valued features, which allows the algorithm to take advantage of the document’s structure in a natural fashion. Cohen [10] shows how RIPPER can classify e-mail messages into user defined categories.10.5Nearest Neighbor MethodsThe nearest neighbor algorithm simply stores all of its training data, here textual descriptions of implicitly or explicitly labeled items, in memory. In order to classify a new, unlabeled item, the algorithm compares it to all stored items using a similarity function and determines the "nearest neighbor" or the k nearest neighbors. The class label or numeric score for a previously unseen item can then be derived from the class labels of the nearest neighbors.The similarity function used by the nearest neighbor algorithm depends on the type of data. For structured data, a Euclidean distance metric is often used. When using the vector space model, the cosine similarity measure is often used [34]. In the Euclideandistance function, the same feature having a small value in two examples is treated the same as that feature having a large value in both examples. In contrast, the cosine similarity function will not have a large value if corresponding features of two examples have small values. As a consequence, it is appropriate for text when we want two documents to be similar when they are about the same topic, but not when they are both not about a topic.The vector space approach and the cosine similarity function have been applied to several text classification applications ([11], [39], [2]) and, despite the algorithm’s unquestionable simplicity, it performs competitively with more complex algorithms. The Daily Learner system uses the nearest neighbor algorithm to create a model of the user’s short term interests [7]. Gixo, a personalized news system, also uses text similarity as a basis for recommendation (Figure 10.4). The headlines are preceded by an icon that indicates how popular the item is (the first bar) and how similar the story is to stories that have been read by the user before (the second bar). The fact that these bars differ shows the value of personalizing to the individual.Fig. 10.4. Gixo presents personalized news based on similarity to articles that have previously been read10.6Relevance Feedback and Rocchio’s AlgorithmSince the success of document retrieval in the vector space model depends on the user’s ability to construct queries by selecting a set of representative keywords [34], methods that help users to incrementally refine queries based on previous search results have been the focus of much research. These methods are commonly referred to as relevance feedback. The general principle is to allow users to rate documents returned by the retrieval system with respect to their information need. This form of feedback can subsequently be used to incrementally refine the initial query. In amanner analogous to rating items, there are explicit and implicit means of collecting relevance feedback data.Rocchio’s algorithm [33] is a widely used relevance feedback algorithm that operates in the vector space model. The algorithm is based on the modification of an initial query through differently weighted prototypes of relevant and non-relevant documents. The approach forms two document prototypes by taking the vector sum over all relevant and non-relevant documents. The following formula summarizes the algorithm formally:∑∑−+=+nonrel ii rel i i i i D D D D Q Q γβα1 (10.2)ere, Q is the user’s query at iteration i , and α, β, and γ are parameters that control ork, researchers have used a variation of Rocchio’s algorithm in a ma 10.7 Linear ClassifiersAlgorithms that learn linear decision boundaries, i.e., hyperplanes separating in-H i the influence of the original query and the two prototypes on the resulting modified query. The underlying intuition of the above formula is to incrementally move the query vector towards clusters of relevant documents and away from irrelevant documents. While this goal forms an intuitive justification for Rocchio’s algorithm, there is no theoretically motivated basis for the above formula, i.e., neither performance nor convergence can be guaranteed. However, empirical experiments have demonstrated that the approach leads to significant improvements in retrieval performance [33].In more recent w chine learning context, i.e., for learning a user profile from unstructured text ([15],[3], [29]). The goal in these applications is to automatically induce a text classifier that can distinguish between classes of documents. In this context, it is assumed that no initial query exists, and the algorithm forms prototypes for classes analogously to Rocchio’s approach as vector sums over documents belonging to the same class. The result of the algorithm is a set of weight vectors, whose proximity to unlabeled documents can be used to assign class membership. Similar to the relevance feedback version of Rocchio’s algorithm, the Rocchio-based classification approach does not have any theoretic underpinnings and there are no performance or convergence guarantees.stances in a multi-dimensional space, are referred to as linear classifiers. There are a large number of algorithms that fall into this category, and many of them have been successfully applied to text classification tasks [20]. All linear classifiers can be described in a common representational framework. In general, the outcome of the learning process is an n -dimensional weight vector w , whose dot product with an n -dimensional instance, e.g., a text document represented in the vector space model, results in a numeric score prediction. Retaining the numeric prediction leads to a linear regression approach. However, a threshold can be used to convert continuouspredictions to discrete class labels. While this general framework holds for all linear classifiers, the algorithms differ in the training methods used to derive the weight vector w . For example, the equation below is known as the Widrow-Hoff rule, delta rule or gradient descent rule and derives the weight vector w by incremental vector movements in the direction of the negative gradient of the example's squared error[37]. This is the direction in which the error falls most rapidly.j i i i i j i j i x y x w w w ,,,1)(2−⋅−=+η (10.3)The equation shows how the weight vector w can be derived incrementally. The inn s experimentally been shown to outperform the ap f the above learning schemes for linear algorithms is that the approaches tend to converge on hy er product of instance x i and weight vector w i is the algorithm’s numeric prediction for instance x i . The prediction error is determined by subtracting the instance’s known score, y i , from the predicted score. The resulting error is then multiplied by the original instance vector x i and the learning rate η to form a vector that, when subtracted from the weight vector w, moves w towards the correct prediction for instance x i. The learning rate η controls the degree to which every additional instance affects the previous weight vector.An alternative algorithm that ha proach above on text classification tasks with many features is the exponentiated gradient (EG) algorithm. Kivinen and Warmuth [19] prove a bound for EG’s error, which depends only logarithmically on the number of features. This result offers a theoretic argument for EG’s performance on text classification problems, which are typically high-dimensional.An important advantage o y can be performed on-line, i.e., the current weight vector can be modified incrementally as new instances become available. This is a crucial advantage for applications that operate under real-time constraints.Finally, it is important to note that while the above perplanes that separate the training data accurately, the hyperplane’s generalization performance might not be optimal. A related approach aimed at improving generalization performance is known as support vector machines [36]. The central idea underlying support vector machines is to maximize the classification margin, i.e., the distance between the decision boundary and the closest training instances, the so-called support vectors. A series of empirical experiments on a variety of benchmark data sets indicated that linear support vector machines perform particularly well on text classification tasks [17]. The main reason for this is that the margin maximization is an inherently built-in overfitting protection mechanism. A reduced tendency to overfit training data is particularly useful for text classification algorithms, because in this domain high dimensional concepts must often be learned from limited training data, which is a scenario prone to overfitting.。
计算机科学与技术毕业论文参考文献示例
计算机科学与技术毕业论文参考文献示例参考文献[1]. Abdellatif, T. and F. Boyer. A node allocation system for deploying JavaEE systems on Grids. 2009. Hammemet, Tunisia.[2]. Bharti, A.K. and S.K. Dwivedi, E-Governance in Public Transportation: U.P.S.R.T.C. ——CAase Study. 2011: Kathmandu, Nepal. p. 7-12.[3]. ChangChun, S.Z.C.S., et al., A Novel Two -stage Algorithm of Fuzzy C-Means Clustering. 2010: 中国吉林长春. p. 85-88.[4]. Changchun, Z.Z.H.Q., Simulation of 3 -C Seismic Records In 2 -D TIM. 1991: 中国北京. p. 489-493.[5]. CHINA, G.C.O.M., The trust model based on consumer recommendation in B-C e-commerce. 2011: 中国湖北武汉. p. 214-217.[6]. ENGINEERING, W.C.H.X., H.T.S.H. PROPAGATION and XINXIANG, A C BAND SYSTEM FOR IONOSPHERIC SCINTILLATION OBSERVATION. 1991: 中国北京. p. 470-476.[7]. Henriksson, K., K. Nordlund and J. Wallenius, Simulating model steels:An analytical bond-order potential for Fe-C. 2008: 中国北京. p. 138.[8]. Jiansen, Y., et al., Suspension K&C Characteristics and the Effect on Vehicle Steering. 2010: 中国吉林长春. p. 408-411.[9]. Jilin, W.G.D.O., C.W.S.D. Changchun and China, Realization and Optimization of Video Encoder Based on TMS320C6455 DSPs. 2010: 中国吉林长春. p. 312-317.[10]. Juan, C., et al., Semi-physical simulation of an optoelectronic tracking servo system based on C MEX S functions. 2010: 中国吉林长春. p. 46-49.[11]. Kachru, S. and E.F. Gehringer. A comparison of j2ee and. net as platforms for teaching web services. 2004.[12]. KIM, T., et al., MRI Image Segmentation Using Intuitive Fuzzy C-Means Algorithm. 2011: 中国湖北武汉. p. 306-309.[13]. Li, M. and H. Wang. A device management system based on JAVAEE Web. 2009. Wuhan, China.[14]. Li, Z. and Z. Weixi. Design of tourism e -business system based on JavaEE multi-pattern. 2012. Sanya, China.[15]. Lin, P., H. Wen and S. Zhou. Design and implementation of job-search system based on javaEE. 2010. Hong Kong, China.[16]. Lv, X., Y . Qin and J.N.G. University, Film growth by polyatomic C2H5+ bombarding amorphous carbon surfaces:molecular dynamics study. 2008: 中国北京. p. 148.[17]. Meyer, B.. NET is coming [Microsoft Web services platform]. Computer, 2001. 34(8): p. 92--97.[18]. Meyer, B., R. Simon and E. Stapf, Instant .NET. Recherche, 2003. 67: p. 02.[19]. Morishita, K., et al., Atomistic modeling of formation kinetics of vacancy clusters in 3C-SiC during irradiation. 2008: 中国北京. p. 141.[20]. Ou, J., et al. Design and research on teaching platform of stage task using JavaEE. 2012. Chongqing, China.[21]. Science,J.X.Z.M., et al., A New Method to get Essential Efficient solution for A Class of D.C.Multiobjective Problem. 2010: 中国吉林长春. p. 501-503.[22]. XIONG, J., L. YAO and J. HU, Implementation of Dynamically Generating HTML WebPages by C\#. Computer and Modernization, 2007. 10.[23]. Yeh, Y. and H. Lin, Cardiac Arrhythmia Diagnosis Method Using Fuzzy C-Means Algorithm on ECG Signals. 2010: 中国台湾台南. p. 272-275.[24]. Yizheng, T., et al. Design and implementation of USB key-based JavaEE dual-factor authentication system. 2009. Xi'an, China.[25]. Zhan, T., et al., Brain MR Image Segmentation and Bias Field Correction Using Adaptive Fuzzy C Means Model. 2010: 中国吉林长春. p. 151-154.[26]. Zhang, L., et al., First -principles investigation of site occupancy of neutral H in 3C - SiC. 2008: 中国北京. p. 86.[27]. Zhang, L. and W. Zhang. Implement of e -governmentsystem with data persistence of JavaEE. 2010. Hong Kong, Hong kong.[28]. 杨盛泉等, 基于MVC 模式的耐火材料梭式窑分布式控制系统. 计算机测量与控制, 2010(6): 第1326-1328+1331 页.[29]. 杨晓强与李海军, 在通用航空安全信息管理系统中的应用. 现代计算机(专业版), 2011(15): 第74-76 页.[30]. 杨勇与韩莉英, 基于MVC 模式的Struts 框架在电子商务系统中的应用. 计算机应用研究, 2006(5): 第172-174 页.[31]. 姚立, 汪峥与穆华灵, 基于知识制造系统优化及 下实现. 计算机技术与发展, 2011(11): 第1-3+7 页.[32].叶晓菡,基于.NET的网络用语在线词典软件的设计与实现.计算机时代, 2010(9): 第27-29 页.[33].易威环,基于Session Facade的MVC模式设计.电脑知识与技术, 2010(15): 第3984-3985+3990 页.[34]. 殷永峰, 王轶辰与刘斌, 基于MVC 模式的嵌入式软件测试开发环境设计. 计算机工程与应用, 2007(7): 第117-119 页.[35]. 游琪, 张广云与桂改花, 基于MVC 模式的角色访问控制系统设计. 电脑知识与技术, 2009(32): 第8939-8940 页.[36].于同亚,用C#设计基于.NET框架的应用程序一一购物网站的设计与实现. 电脑知识与技术, 2009(18): 第4907-4908页.[37]. 袁江琛, 基于 的校园信息网设计和开发. 电脑编程技巧与维护, 2011(24): 第23-24+49 页.[38]. 占小忆, 中利用 连接数据库. 电脑知识与技术(学术交流), 2007(5): 第1211-1212 页.[39].张峰与张莉莉,平台连接池机制的分析与设计.电脑学习, 2008(2): 第89-90页.[40].张国武,基于OPC和.NET框架的SIMATICNET客户应用实现. 工业控制计算机, 2008(4): 第70-71 页.[41]. 张捍卫, 基于 AJAX 的资产网络清查系统的设计. 计算机与现代化, 2012(4): 第94-96 页.[42]. 张建成与李春青, 基于.NET 环境下 访问数据库技术的研究. 电脑知识与技术, 2009(22): 第6102-6104 页.[43]. 张杰, 张景安与孙沛, 基于云模型的C2C 电子商务信任评价模型. 计算机系统应用, 2010(11): 第83-87+74 页.[44].张黎明与龚琪琳,基于MVC模式的Java Web应用设计.计算机与现代化, 2007(2): 第22-24 页.[45]. 张俐, MVC 模式在数据中间件中的应用. 计算机工程,2010(9): 第70-72 页.[46].张俐,基于JavaEE的电信CRM数据持久层的实现.计算机工程, 2009(6): 第41-43 页.[47].张俐与张维玺,基于JavaEE的固定资产管理系统的设计与实现. 计算机工程与设计, 2009(16): 第3797-3800 页.[48]. 张南平与朱富利, 基于MVC 模式的Struts 框架的研究与应用.计算机技术与发展, 2006(3): 第229-231+234 页.应用, 2009(19): 第139-141页.[50]. 张翔, 陆远与罗贵明, 模型与实例设计模式在工作流管理系统设计中的应用. 计算机应用研究, 2006(7): 第165-166+169 页.[51]. 张永才与吾守尔?斯拉木, 基于J2ME 的维汉双语电子词典的研究与实现. 计算机系统应用, 2010(7): 第229-231 页.[52]. 张宗平, 马冰冰与莫灵江, 基于 的网络培训系统的研究. 现代计算机(专业版), 2011(14): 第52-54页.[53]. 赵亮与齐欢, 基于MVC 模式的三峡通航数据维护系统的实现. 计算机技术与发展, 2006(7): 第156-158 页.[54].赵鸣,张旭与熊静,基于.NET与WCF的民航订座系统研究.计算机工程与设计, 2012(4): 第1653-1659 页.[55].郑华,基于MVC模式的Tapestry框架研究与应用.微电子学与计算机, 2006(S1): 第38-39+42 页.[56]. 郑晶晶与刘玉宾, 基于 的 对象与数据库的交互. 电脑知识与技术, 2009(2): 第293-295页.[57].郑文等,基于客户端MVC模式的RIA WebGIS框架设计与应用. 计算机应用与软件, 2011(5): 第75-77+93 页.[58]. 郑颖与袁宝国, MVC 模式在中小型连锁超市信息管理系统的应用. 计算机应用与软件, 2006(9): 第134-136页.[59]. 钟金琴与辜丽川, 一种面向对象的软件设计模式库的设计. 计算机技术与发展, 2008(9): 第22-25 页.[60]. 钟灵等, MVC 模式的电梯群控仿真系统. 计算机工程与应用, 2009(32): 第197-199+243 页.[61].周纯杰,陈笛与阎峰,基于.NET的圆网印花机远程监控系统的开发. 计算机应用研究, 2005(6): 第154-156页.[62]. 周迅飞与王崑声, 基于MVC 模式的Rails 框架研究. 计算机仿真, 2006(2): 第270-274 页.[63]. 周杨, AJAX 应用的典型设计模式. 计算机系统应用, 2011(1): 第128-132 页.[64]. 周永平, MVC 模式在软件设计应用中的研究. 信息与电脑(理论版), 2009(11): 第58-59页.[65]. 朱青卫, 基于NET 框架可重用的数据访问组件的实现. 电脑与电信, 2007(1): 第62-64页.[66]. 朱卫新, Visual C#.NET 实现用户自定义图形编程方法. 计算机技术与发展, 2012(4): 第130-132+136 页.[67]. 庄新妍, 基于CDIO 教育的 程序设计课程教学改革初探. 电脑知识与技术, 2011(35): 第9192-9193 页.[68]. 卓有斌, 浅论我国B2C/C2C 电子商务物流的现状及几点对策. 电脑知识与技术, 2010(14): 第3839-3840 页.[69]. 邹利艳等, 基于JavaEE 架构的旅游电子商务平台的设计开发. 电脑知识与技术, 2011(4): 第712-714 页.[70]. 曾琳与蔡晓丽, 基于MVC 模式的网络教学平台. 电脑知识与技术, 2009(6): 第1527-1528页.[71].曾路与张立臣, ―― 于.NET平台的AOP技术.计算机应用研究, 2005(5): 第225-226 页.[72].曾一等,基于J2EE平台的Java构件库的研究和实现.计算机科学, 2006(4): 第274-276+280 页.[73]. 常红伟, 基于.NET 的信息交换平台的设计与实现. 计算机与现代化, 2011(12): 第37-40 页.[74]. 陈春艳, 支持多浏览器读取XML 内容的方法实现. 电脑与电信, 2010(6): 第71-73页.[75].陈烽与陈蓉,基于MVC模式和JavaBean的B2C电子商城框架的实现. 电脑与电信, 2010(1): 第50-52 页.[76].陈刚等,基于Chord的合作浏览器Cache模型.计算机应用与软件, 2006(5): 第93-95 页.[77]. 陈华恩, JAVA 设计模式研究之抽象工厂模式. 电脑知识与技术2010(9): 第2245-2246 页.[78]. 陈乐与杨小虎, MVC 模式在分布式环境下的应用研究. 计算机工程, 2006(19): 第62-64 页.[79]. 陈武, 阳春华与吴同茂, 基于Ajax 技术的MVC 模式在远程控制实验室中的应用. 计算机系统应用, 2009(8): 第175-177 页.[80]. 陈小祥, 洪金益与吴健生, 基于 Link 和.net 的WebGIS的实现.计算机应用与软件,2008(3):第135-137页.[81].陈绪君等.NET框架Web Service和.NET Remoting分布式应用解决方案及评价. 计算机应用研究, 2003(9): 第110-112 页.[82]. 陈学锋等, 分布式实时网络监测系统的设计与实现. 计算机工程, 2002(6): 第139-140+143 页.[83]. 陈雪娟, 基于MVC 模式的SSH 开发技术. 电脑学习,2011(2): 第137-139 页.[84].陈谊楠,基于.NET平台采用实现数据访问层.电脑编程技巧与维护, 2012(4): 第35-36+57 页.[85]. 程郢瑞, 郭福亮与王晶, 基于MVC 模式的人才测评系统的分析与设计. 计算机与数字工程, 2010(1): 第197-200 页.[86]. 崔阳华, +Oracle 环境下安全方案的研究. 智能计算机与应用, 2011(6): 第36-39 页.[87]. 丁民豆, MVC 模式在创建图表组件中的应用与研究. 电脑知识与技术, 2011(24): 第5925-5927 页.[88]. 董袁泉, 基于MVC 模式的Struts 框架研究与应用. 电脑编程技巧与维护, 2010(22): 第25-26+66 页.[89]. 窦亮, 金恩年与黄国兴, 基于MVC 设计模式的电子名片系统的设计与实现. 计算机工程, 2005(21): 第229-231 页.[90]. 杜青, Visual C++.NET 平台下贪吃蛇游戏的实现. 电脑编程技巧与维护, 2011(24): 第36-37 页.[91]. 段春梅, 基于MVC 模式的课程申报系统的设计与实现. 电脑学习, 2009(6): 第27-28页.[92]. 方春平与管建和, 基于多重数组的词典技术研究与实现. 电脑知识与技术, 2009(9): 第2173-2174 页.[93]. 方俊, 基于.NET Remoting 口令系统的设计. 计算机时代2012(4): 第9-11 页.[94].方文骁与张在琛,基于.NET框架的网络视频处理•计算机工程2011(S1): 第359-361 页.[95]. 冯东栋与张钊, 一种扩展的MVC 手机设计模式. 计算机与现代化, 2008(8): 第122-124 页.[96]. 冯铁等, 面向Java 语言的设计模式抽取方法的研究. 计算机工程与应用, 2005(25): 第28-33 页.[97]. 冯晓强与程晓昕, 基于MVC 模式的网上购物系统的设计与实现.现代计算机(专业版), 2009(7): 第177-180 页.[98]. 葛文庚与郭斐斐, 基于MVC 模式的物流管理信息系统的设计与实现. 电脑知识与技术, 2010(22): 第6135-6136 页.[99]. 耿祥义与郝丽, JAVA 设计模式在视频监控系统软件设计中的应用. 电脑知识与技术, 2010(30): 第8490-8492 页.[100]. 卢莉, 基于淘宝网的C2C 电子商务信用评价模型改进研究. 现代计算机(专业版), 2011(28): 第30-32 页.[101]. 卢贤玲等, 基于Java 网上虚拟实验系统设计与实现. 计算机工程与应用, 2004(7): 第158-160 页.[102]. 陆银梅, Web 服务器控件设计. 电脑编程技巧与维护, 2011(22): 第98-99+142 页.[103]. 罗红梅与陆鑫, 基于JSF 和Tiles 的MVC 模式的实现. 计算机与现代化, 2006(2): 第38-41 页.[104]. 马立东, 面向英汉词典编纂的粘贴工具的设计、实现及应用.计算机与现代化, 2010(7): 第145-148 页.[105]. 马庆兵, 基于MVC 模式的Struts 框架研究与应用. 信息与电脑(理论版), 2009(12): 第68-69 页.[106]. 马帅军, 陈洲与陈念, 基于规则引擎和MVC 模式的管理系统设计. 电脑与电信, 2010(2): 第34-36页.[107]. 牛俊慧, 张红光与牛会丽, 基于MVC 模式的电子商务平台构造技术研究. 计算机工程与设计, 2006(23): 第4479-4481 页.[108]. 潘海兰, 吴翠红与葛晓敏, XML 及其在MVC 模式中的应用. 计算机技术与发展, 2010(2): 第202-205 页.[109]. 裴炳镇等, 一种生成机读词典的方法. 计算机工程与应用, 2005(3): 第116-118 页.[110]. 彭明与蒋晓瑜, 基于 技术的网上书城系统分析. 计算机光盘软件与应用, 2012(4): 第176-177 页.[111]. 齐德昱与谢景明, 一个基于Java 虚拟机的分布式计算模型. 计算机科学, 2007(6): 第248-250 页.[112]. 綦宏伟, 代亚非与李晓明, 基于Java/Swing 的通用文件管理器设计模式. 计算机工程与应用, 2003(8): 第108-111 页.[113]. 全金连, 李琴与覃毅, 基于MVC 模式的成人教学管理系统的设计与实现. 电脑知识与技术, 2010(9): 第2180-2181 页.[114]. 任桢, 电子词典的设计研究. 计算机与数字工程, 2003(1): 第62-64+51 页.[115]. 戎小群, 面向对象设计模式的研究与应用. 电脑知识与技术2010(33): 第9437-9439 页.[116]. 沈刚, 下项目管理系统的设计和研究. 计算机光盘软件与应用, 2012(3): 第197+191 页.[117]. 史栋杰与孔华锋, 领域驱动设计中资源库模式的设计与实现. 电脑知识与技术, 2010(33): 第9617-9618+9621 页.[118]. 史学梅, Ajax 技术在EXT 框架与MVC 模式整合中的应用. 电脑知识与技术, 2010(24): 第6779-6780 页.[119]. 孙惠芬, 基于NET 环境下的日报表系统的设计与实现. 电脑知识与技术, 2011(35): 第9143-9144 页.[120].孙金艳,基于JavaEE的移动新闻系统的设计与实现.电脑知识与技术, 2011(32): 第8023-8024 页.[121].孙静,基于以太网控制器LAN91C111的口c/TCP-IP网络接口通信实现. 电脑学习, 2010(6): 第52-54页.[122].孙艳红,利用C++ Builder幵发Web浏览器.电脑学习,2009(1): 第26-27 页.[123].谈娴茹,基于Browser/Server的课件系统的设计与实现.计算机工程, 2005(S1): 第165-166页.[124]. 覃开贤与卢澔, 基于MVC 模式的在线作业系统的设计. 计算机与现代化, 2011(2): 第160-163 页.[125]. 谭建与丁维明, 基于面向对象和设计模式的电厂工作票软件模块的设计. 计算机与现代化, 2005(1): 第113-115 页.[126]. 唐伟, C2C 电子商务信任评价模型设计与实现. 计算机工程与应用, 2012(1): 第94-97页.[127]. 田飞与程慧芳, 基于P2P 网络的浏览器缓存协作系统的研究计算机工程与设计, 2010(22): 第4780-4786 页.[128].王峰,基于Struts和Hibernate的MVC模式在网站架构设计中的应用. 信息与电脑(理论版), 2009(8): 第133+135 页.[129]. 王海蓉与苗放, 基于MVC 模式的STRUTS 框架的研究与应用电脑知识与技术, 2006(26): 第102-103 页.[130]. 王莉, 基于 搜索引擎模型的实现. 计算机与现代化2011(11): 第199-201+205 页.[131]. 王敏, 基于MVC 模式的校友录系统设计与实现. 计算机与数字工程, 2011(2): 第104-107页.[132]. 王翔, 用设计模式和.Net 技术实现对象池设计. 计算机辅助工程, 2007(4): 第68-72 页.[133].王向中,基于MVC模式Web应用框架的研究和幵发.电脑编程技巧与维护, 2009(22): 第85-86 页.[134]. 王晓庆等, 设计模式中的面向对象原则及其子模式. 计算机工程, 2003(9): 第192-194页.[135]. 王艳华与何保锋, 基于MVC 模式的数据库分页显示的设计与实现. 电脑知识与技术(学术交流), 2007(6): 第1502-1503页.[136]. 王云晓与张学诚, 基于.NET 的数据库访问优化策略. 计算机与现代化, 2011(12): 第86-88 页.[137]. 王喆, 基于.NET 的作业处理系统的设计与实现. 计算机应用与软件, 2012(4): 第213-215页.[138]. 文爱平与文德民, 基于IE 浏览器的Ajax Comet 架构. 电脑知识与技术, 2010(17): 第4646-4648 页.[139].文习明,平台下对MVC模式的一个扩展.现代计算机,2006(4): 第23-26 页.[140].文学,轻量级JavaEE的另一种选择:JST.华南金融电脑, 2009(4): 第52-53 页.[141]. 吴国芳与金珊, 基于MVC 模式的医疗事故争议处理系统. 电脑知识与技术, 2009(32): 第8976-8977 页.[142]. 吴海珍, 和ADOX 在 数据库编程中的应用. 电脑与信息技术, 2009(1): 第73-75 页.[143].吴茂昌与阳玉琴,基于MVC模式的Java主流框架整合技术研究. 计算机与数字工程, 2009(10): 第91-93+111 页.[144]. 吴森, 王克峰与谢佳, 在 环境下高效使用SQL 数据提供程序连接池. 计算机与数字工程, 2005(11): 第86-89页.[145].武小稞,基于.NET的出口活牛育肥场电子信息管理系统•计算机光盘软件与应用, 2012(3): 第159+161 页.[146].肖菁,高校非计算机专业C/C++教学的探索与实践.现代计算机(专业版), 2011(30): 第21-22 页.[147]. 肖茂兵与卢振环, JavaEE 应用技术框架选型. 华南金融电脑, 2006(8): 第78-81 页.[148]. 谢波, 申瑞民与王加俊, 一种基于.NET Services 的标注系统框架“ MyAnnotation . N ET ”. 计算机工程, 2003(16): 第179-181 页.[149]. 谢珩等, MVC 模式在Web 应用中的一种实现. 计算机科学, 2006(5): 第136-138 页.[150]. 辛玉华与王哲, 对基于.NET FRAME 架构软件汉化的一次尝试. 电脑知识与技术, 2012(5): 第1184-1186 页.[151]. 熊建芳, 高继与任贺宇, 基于 的ADO 与 分析与研究. 计算机与现代化, 2006(7): 第36-38 页.[152]. 熊建英与钟元生, 一种抗欺诈的C2C 卖方信誉计算模型研究. 计算机科学, 2012(2): 第68-71 页.[153]. 熊勇, MVC 模式下一种高效分页方法的研究与实现. 电脑知识与技术, 2006(29): 第75+78 页.[154]. 徐涤, 基于 的人事管理系统设计与实现. 电脑编程技巧与维护, 2012(8): 第31-32 页.[155]. 徐生菊与王命延, MVC 模式在储粮害虫查询与防治系统中的应用研究. 计算机与现代化, 2006(4): 第112-114 页.[156]. 阎秀英, 周亚建与胡正名, 基于Java 的网络实时远程监控系统设计. 计算机工程, 2009(5): 第74-75+78 页.[157]. 阎英, 刘伯红与席珍, 基于MVC 模式Struts 结构的研究生管理信息系统. 计算机与数字工程, 2007(4): 第39-41+2 页.[158]. 杨岸, 丁汉与熊有伦, 电子词典词库的压缩技术研究与实现. 计算机工程与设计, 2004(3): 第340-343 页.[159]. 杨浮群, 邹利艳与徐丽, JavaEE 开发环境下Sql Serve 数据库优化. 电脑知识与技术, 2011(31): 第7597-7599 页.[160]. 杨刚, 顾宏斌与赵芷晴, 对基于J2EE 的MVC 模式视图部分改进. 计算机技术与发展, 2012(3): 第103-105+109 页.[161].杨厚群与陈静,Java异常处理机制的研究•计算机科学,2007(3): 第286-289+292 页.[162]. 杨洁芳, 基于VC++ 的WEB 浏览器的实现. 电脑学习,2006(1):第30-31 页.[163].杨柳.NET环境下MD5加密技术的研究.计算机安全,2011(12): 第43-46 页.[164]. 杨睿与姚淑珍, 设计模式复用支持系统的设计实现. 计算机工程, 2004(1): 第80-81+87 页.[1 ] Bollela G, Gosling J, Brosgol B, et al. The Real -time Specification for Java[M]. [S. l.]: Addison Wesley, 2000.[2]Connor J M O, Tremblay M. PicoJava -I: The Java Virtual Machine in Hardware[J]. IEEE Micro, 1997, 17(2): 45 -53.[3]McGhan H, Connor J M O. PicoJava: A Direct Execution Engine for Java Bytecode[J]. Computer, 1998, 31(10): 22-30.[4]Brinkschulte U, Krakowsi C, Kreuzinger J, et al. The Komodo Project: Thread-based Event Handling Supported by a Multi - threaded Java Microcontroller[C]//Proceedings of the 25th EUROMICRO Conference. Milano, Italy: [s. n.], 1999.[5]Schoeberl M. JOP: A Java Optimized Processor for Embedded Real-time Systems[EB/OL]. [2010 -08-26]. http://www.jopdesign. com.[6]柴志雷.Java实时性及嵌入式实时Java处理器研究[D].上海:复旦大学, 2006.[7]苏超云,柴志雷,涂时亮.实时Java平台的类预处理器研究[J].计算机工程, 2010, 36(7): 246-248, 251.[8]M L501 User Guide[EB/OL]. (2009-08-24). http://www.xilinx. com/products/boards /ml501/docs.htm.[9]S choeberl M, Puffitsch W, Pedersen R U, et al. Worst -case Execution Time Analysis for a Java Processor[J]. Software: Practice and Experience, 2010, 40(6): 507-542.[10]Patterson D A, Hennessy J L. 计算机组成与设计硬件/软件接口[M]. 3 版. 郑纬民, 译. 北京: 机械工业出版社, 2007.。
Research on personalized recommendation system in E-learning
Research on Personalized Recommendation System in E-Iearning
Rong Shan
Zhibin Ren
authoring learning materials and preparing tests or examinations than their classical counter parts. The necessity of mastering technology-intensive teaching tools and the lack of the tutor's computer literacy often make tutors reluctant to participate in online teaching activity. Based on the analysis of the aforementioned limitations existing current e Education systems, it is obvious that on one hand, we need to provide learner with more intelligent learning environment that supports various customized learning services as needed, on the other hand, we need innovative mechanism to all aviate tutor work load in terms of facilitating the development of learning contents and test/exam by hiding as much technique details as possible. Now days more and more educators believe that the above identified factors are a key to the future successful e-Education and thus naturally become the focus and motivation of this dissertation. Now there are some personalized recommendation system [2], but users are asked to fill out registration information or regularly gives certain products (such as PTV) [3] information to build its user interest model by some of it, or the server access logs been analyzed to understand user behavior and interests to achieve the final recommendation. The former relies heavily on the use of the system, and to some extent will affect the user's normal activities; the latter when the user interest changes, recommended effect declined. For the current recommendation system problems, the paper design a personalized recommendation system which based on the browsing behavior (Browsing Behavior Personalized Information Recommendation System,BBIRS). II.
Recommendation method and system based on rating s
专利名称:Recommendation method and system based on rating space partitioned data发明人:DANIEL FRANKOWSKI,PAULBIEGANSKI,ROBERT DRISKILL,VALERIEGURALNIK,FILIP MULIER申请号:AU3284601申请日:20010119公开号:AU3284601A公开日:20010731专利内容由知识产权出版社提供摘要:Methods and systems consistent with the present invention provide a recommendation system that uses positive data and/or negative data separately to locate neighbors and provide recommendation to users. Such methods and systems use the data to locate potential neighbors based on users' ratings. Methods and systems consistent with the present invention calculate affinity values between the user and potential neighbors located to determine whether the potential neighbor's ratings are closely related to that of the user's ratings. If a user and a potential neighbor have an affinity greater than a predetermined threshold, that neighbor is considered close enough to the user to provide a recommendation for various items. Affinity values are calculated from a series of affinity equations available to the recommendation system.申请人:NET PERCEPTIONS, INC.更多信息请下载全文后查看。
新闻管理系统参考文献
新闻管理系统参考文献在设计和开发新闻管理系统方面,以下是一些常用的参考文献:1. "Building News Reader Using Web Technologies" by Miriam Redi and Franciska de Jong - 该文献提供了关于使用Web技术构建新闻阅读器的详细指导和方法,包括数据抓取、分析和展示。
2. "A Comparative Study of News Recommendation Approaches" by Rakesh Agrawal and et al. - 这篇研究文章比较了不同的新闻推荐方法,包括协同过滤、基于内容的过滤和混合过滤方法,对于新闻管理系统的推荐功能提供了启发。
3. "Design and Implementation of a Content Management System for News Agencies" by Marzieh Fadaee and Mohsen Kahani - 这篇文献介绍了一个新闻机构的内容管理系统的设计和实施,包括新闻编辑、发布和存档功能的详细说明。
4. "Design and Implementation of a News Recommender System based on Collaborative Filtering" by Zhiwen Yu, Xi Zhang, and et al. - 这篇文献描述了一个基于协同过滤的新闻推荐系统的设计和实施,为新闻管理系统的推荐功能提供了指导。
5. "A Comparative Analysis of Database Models for News Recommendation Systems" by Manfred Hauswirth and et al. - 这篇文献比较了不同的数据库建模方法在新闻推荐系统中的应用,对于新闻管理系统的数据库设计提供了参考。
毕业论文题目
毕业论文题目较好的毕业论文题目应该具备以下特点:具体明确,具有研究性、创新性,能够吸引读者兴趣,具备深入研究的可能性。
以下是一个750字的毕业论文题目示例:《基于人工智能技术的电影推荐系统设计与优化研究》摘要:本论文旨在通过分析国内外电影推荐系统的现状与问题,研究并设计一种基于人工智能技术的电影推荐系统,并通过算法优化,提高系统推荐的准确性和个性化水平。
首先,通过文献综述总结了目前主流电影推荐系统的发展现状以及存在的问题;然后,对常用的推荐算法进行了比较和分析,包括基于内容的推荐算法、协同过滤推荐算法、混合推荐算法等;接着,结合电影相关特征和用户行为数据,设计了一种基于深度学习的推荐算法模型,包括特征提取、特征选择、模型训练等步骤;最后,通过实验验证了所设计算法模型的有效性和优越性,并提出了系统的进一步优化方向。
关键词:电影推荐系统、人工智能、算法优化、个性化推荐、深度学习Abstract: This thesis aims to design and optimize a movie recommendation system based on artificial intelligence technology through analyzing the current status and problems of domestic and foreign movie recommendation systems. By improving algorithm, the accuracy and personalization of the system's recommendations are expected to be enhanced. Firstly, the development status and problems of mainstream movie recommendation systems weresummarized through literature review. Then, common recommendation algorithms including content-based recommendation, collaborative filtering recommendation, and hybrid recommendation algorithm were compared and analyzed. After that, combining movie-related features and user behavior data, a deep learning-based recommendation algorithm model was designed, consisting of feature extraction, feature selection, and model training steps. Finally, the effectiveness and superiority of the designed algorithm model were verified through experiments, and further directions for system optimization were proposed. Keywords: movie recommendation system, artificial intelligence, algorithm optimization, personalized recommendation, deep learning。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
WEB RECOMMENDATION SYSTEM BASED ON AMARKOV-CHAIN MODELFrancois Fouss,Stephane Faulkner,Manuel Kolp,Alain Pirotte,Marco SaerensInformation Systems Research UnitIAG,Universite catholique de Louvain,Place des Doyens1,B-1348Louvain-la-Neuve,Belgiumfouss,kolp,pirotte,saerens@isys.ucl.ac.beInformation Systems Research UnitDepartment of Management Science,Universite of Namur,Rempart de la Vierge8,B-5000Namur,Belgiumstephane.faulkner@fundp.ac.beKeywords:Collaborative Filtering,Markov Chains,Multi Agent System.Abstract:This work presents some general procedures for computing dissimilarities between nodes of a weighted,undi-rected,graph.It is based on a Markov-chain model of random walk through the graph.This method is appliedon the architecture of a Multi Agent System(MAS),in which each agent can be considered as a node andeach interaction between two agents as a link.The model assigns transition probabilities to the links betweenagents,so that a random walker can jump from agent to agent.A quantity,called the averagefirst-passagetime,computes the average number of steps needed by a random walker for reaching agent for thefirst time,when starting from agent.A closely related quantity,called the average commute time,provides a distancemeasure between any pair of agents.Yet another quantity of interest,closely related to the average commutetime,is the pseudoinverse of the Laplacian matrix of the graph,which represents a similarity measure be-tween the nodes of the graph.These quantities,representing dissimilarities(similarities)between any twoagents,have the nice property of decreasing(increasing)when the number of paths connecting two agentsincreases and when the“length”of any path decreases.The model is applied on a collaborativefiltering taskwhere suggestions are made about which movies people should watch based upon what they watched in thepast.For the experiments,we build a MAS architecture and we instantiated the agents belief-set from a realmovie database.Experimental results show that the Laplacian-pseudoinverse based similarity outperforms allthe other methods.1IntroductionGathering product information from large elec-tronic catalogue on E-Commerce sites can be a time-consuming and information-overloading process.As information becomes more and more available on the World Wide Web,it becomes increasingly difficult for users tofind the desired product from the millions of products available.Recommender systems have emerged in response to these issues(Breese et al., 1998),(Resnick et al.,1994),or(Shardanand and Maes,1995).They use the opinions of members of a community to help individuals in that community to identify the information or products most likely to be interesting to them or relevant to their needs.As so, recommender systems can help E-commerce in con-verting web surfers into buyers by personalization of the web interface.They can also improve cross-sales by suggesting other products in which the consumer might be interested.In a world where an E-commerce site competitors are only two clicks away,gaining consumer loyalty is an essential business strategy.In this way,recommender systems can improve loyalty by creating a value-added relationship between sup-plier and consumer.One of the most successful technologies for recom-mender systems,called collaborativefiltering(CF), has been developed and improved over the past decade.For example,the GroupLens Research sys-tem(Konstan et al.,1997)provides a pseudony-mous CF application for Usenet news and movies. Ringo(Shardanand and Maes,1995)and MovieLens (Sarwar et al.,2001)are web systems that generate recommendations on music and movies respectively, suggesting collaborativefiltering to be applicable to many different types of media.Moreover,some of the highest commercial web sites like ,, and madeuse of CF technology.Although CF systems have been developed with success in a variety of domains,important research issues remain to be addressed in order to overcome two fundamental challenges:performances(e.g.,the CF system can deal with a great number of consumers in a reasonable amount of time)and accuracy(e.g., users need recommendations they can trust to help themfind products they will indeed like).This paper addresses both challenges by propos-ing a novel method for CF.The method includes a procedure based on a Markov-chain model used for computing dissimilarities between nodes of an undi-rected graph.This procedure is applied on the ar-chitecture of a Multi Agent System(MAS),in which each agent can be considered as a node and each in-teraction among agents as a link.Moreover,MAS ar-chitectures are gaining popularity over classic ones to build robust andflexible CF applications(Wooldridge and Jennings,1994)by distributing responsabilities among autonomous and cooperating agents.For illustration purposes,we consider in this work a simple MAS architecture which supports an E-commerce site selling DVD movies.The MAS ar-chitecture is instantiated with three sets of agents: user agent,movie agent and movie-category agent, and two kinds of interactions:between user agent and movie agent(has watched),and between movie agent and movie-category agent(belongs to).Then, the procedure allows to compute dissimilarities be-tween any pair of agents:Computing similarities between user agents allows to cluster them into groups with similar interest about bought movies.Computing similarities between user agent and movie agents allows to suggest movies to buy or not to buy.Computing similarities between user agent and movie-category agents allows to attach a most rel-evant category to each user agent.To compute the dissimilarities,we define a random-walk model through the architecture of the MAS by assigning a transition probability to each link (i.e.,interaction instance).Thus,a random walker can jump from neighbouring agents and each agent there-fore represents a state of the Markov model.From the Markov-chain model,we then compute a quantity,,called the averagefirst-passage time(Kemeny and Snell,1976),which is the average number of steps needed by a random walker for reach-ing state for thefirst time,when starting from state. The symmetrized quantity,, called the average commute time(Gobel and Jagers, 1974),provides a distance measure between any pair of agents.The fact that this quantity is indeed a dis-tance on a graph has been proved independently by Klein&Randic(Klein and Randic,1993)and Gobel &Jagers(Gobel and Jagers,1974).These dissimilarity quantities have the nice prop-erty of decreasing when the number of paths connect-ing the two agents increases and when the“length”of any path decreases.In short,two agents are consid-ered similar if there are many short paths connecting them.To our knowledge,while being interesting alterna-tives to the well-known“shortest path”or“geodesic”distance on a graph(Buckley and Harary,1990),these quantities have not been exploited in the context of collaborativefiltering;with the notable exception of (White and Smyth,2003)who,independently of our work,investigated the use of the averagefirst-passage time as a similarity measure between nodes.The “shortest path”distance does not have the nice prop-erty of decreasing when connections between nodes are added,therefore facilitating the communication between the nodes(it does not capture the fact that strongly connected nodes are at a smaller distance than weakly connected nodes).This fact has already been recognized in thefield of mathematical chem-istry where there were attempts to use the“commute time”distance instead of the“shortest path”distance (Klein and Randic,1993).Notice that there are many different ways of computing these quantities,by us-ing pseudoinverses or iterative procedures;details are provided in a related paper.Section2introduces the random-walk model-a Markov chain model.Section3develops our dissim-ilarity measures as well as the iterative formulae to compute them.Section4specifies our experimental methodology.Section5illustrates the concepts with experimental results obtained on a MAS instantiated from the MovieLens database.Section6is the con-clusion.2A Markov-chain model of MAS architecture2.1Definition of the weighted graphA weighted graph is associated with a MAS archi-tecture in the following obvious way:agents corre-spond to nodes of the graph and each interaction be-tween two agents is expressed as an edge connecting the corresponding nodes.In our movie example,this means that each instan-tiated agent(user agent,movie agent,and movie cat-egory agent)corresponds to a node of the graph,and each has_watched and belongs_to interaction is expressed as an edge connecting the corresponding nodes.The weight of the edge connecting node and node(say there are nodes in total)should be set to some meaningful value,with the following convention:the more important the relation between node and node,the larger the value of,and consequently the easier the communication through the edge.Notice that we require that the weights be both positive()and symmetric(). The elements of the adjacency matrix of the graph are defined in a standard way asif node is connected to nodeotherwise(1) where is symmetric.We also introduce the Lapla-cian matrix of the graph,defined in the usual man-ner:(2) where with(element of is).We also suppose that the graph is connected;that is,any node can be reached from any other node of the graph.In this case,has rank,where is the number of nodes(Chung,1997).If is a column vector made of(i.e.,T,where T de-notes the matrix transpose)and is a column vector made of,and T T hold:is doubly centered.The null space of is therefore the one-dimensional space spanned by.Moreover,one can easily show that is symmetric and positive semidef-inite(Chung,1997).Because of the way the graph is defined,user agents who watch the same kind of movie,and there-fore have similar taste,will have a comparatively large number of short paths connecting them.On the contrary,for user agents with different interests,we can expect that there will be fewer paths connecting them and that these paths will be longer.2.2A random walk model on thegraphThe Markov chain describing the sequence of nodes visited by a random walker is called a random walk on a weighted graph.We associate a state of the Markov chain to every node(say in total);we also define a random variable,,representing the state of the Markov model at time step.If the random walker is in state at time,we say.We define a random walk by the following single-step transition probabilities,whereIn other words,to any state or node,we asso-ciate a probability of jumping to an adjacent node,,which is proportional to the weight of the edge connecting and.The transition probabilities only depend on the current state and not on the past ones(first-order Markov chain).Since the graph is totally connected,the Markov chain is irre-ducible,that is,every state can be reached from any other state.If this is not the case,the Markov chain can be decomposed into closed sets of states which are completely independent(there is no communica-tion between them),each closed set being irreducible. Now,if we denote the probability of being in state at time by and we define as the transition matrix whose entries are,the evolution of the Markov chain is characterized byOr,in matrix form,T(3) where T is the matrix transpose.This provides the state probability distributionT at time once theinitial probability density,,is known.For more de-tails on Markov chains,the reader is invited to consult standard textbooks on the subject(Bremaud,1999),(Kemeny and Snell,1976),(Norris,1997).3Averagefirst-passage time and average commute timeIn this section,we review two basic quantities that can be computed from the definition of the Markovchain,that is,from its probability transition matrix:the averagefirst-passage time and the average com-mute time.Relationships allowing to compute thesequantities are derived in a heuristic way(see,e.g., (Kemeny and Snell,1976)for a more formal treat-ment).3.1The averagefirst-passage time The averagefirst-passage time,is defined as the average number of steps a random walker,start-ing in state,will take to enter state for thefirst time(Norris,1997).More precisely,we define theminimum time until absorption by state asand for one realiza-tion of the stochastic process.The averagefirst-passage time is the expectation of this quantity,when starting from state:. We show in a related paper how to derive a recur-rence relation for computing byfirst-step anal-ysis.We obtain,for(4) These equations can be used in order to iteratively compute thefirst-passage times(Norris,1997).The meaning of these formulae is quite obvious:in order to go from state to state,one has to go to any ad-jacent state and proceed from there.3.2The average commute timeWe now introduce a closely related quantity,the aver-age commute time,,which is defined as the average number of steps a random walker,starting in state,will take before entering a given state for thefirst time,and go back to.That is,.Notice that,whileis symmetric by definition,is not.3.3The average commute time is adistanceAs shown by several authors(Gobel and Jagers, 1974),(Klein and Randic,1993),the average com-mute time is a distance measure,since,for any states ,,:if and only ifAnother important point not proved here is that is a matrix whose elements are the inner prod-ucts of the node vectors embedded in an Euclidean space preserving the ECTD between the nodes in this Euclidean space,the node vectors are exactly sepa-rated by ECTD.can therefore be considered as a similarity matrix between the nodes(as in the vectors space model in information retrieval).In summary,three basic quantities will be used as providing a dissimilarity/similarity measure between nodes:the averagefirst-passage time,the average commute time,and the pseudoinverse of the Lapla-cian matrix.4Experimental methodology Remember that each agent of the three sets cor-responds to a node of the graph.Each node of the user-agent set is connected by an edge to the watched movies of the movie-agent set.In all these experi-ments we do not take the movie-category agent set into account in order to perform fair comparisons be-tween the different methods.Indeed,two scoring algorithms(i.e.,cosine and nearest-neighbours algo-rithms)cannot naturally use the movie-category set to rank the movies.4.1Data setFor these experiments,we developed a MAS archi-tecture corresponding to our movie example.The belief set of the user agents,movie agents,and movie-category agents has been instantiated from the real MovieLens database(). Each week hundreds of users visit MovieLens to rate and receive recommendations for movies.We used a sample of this database proposed in (Sarwar et al.,2002).Enough users were randomly selected to obtain100,000ratings(considering only users that had rated20or more movies).The database was then divided into a training set and a test set(which contains10ratings for each of 943users).The training set set was converted into a2625x2625matrix(943user agents,and1682 movie agents that were rated by at least one of the user agents).The results shown here do not take into account of the ratings provided by the user agents here(the experiments using the ratings gave similar results)but only the fact that a user agent has or has not interacted with a movie agent(i.e.,the user-movie matrix isfilled in with’s or’s).We then applied the methods described in Section 4.2to the training set and compared the results thanks to the test set.4.2Scoring algorithmsEach method supplies,for each user agent,a set of similarities(called scores)indicating preferences about the movies,as computed by the method.Tech-nically,these scores are derived from the computa-tion of dissimilarities between the user-agent nodes and the movie-agent nodes.The movie agents that are closest to an user agent,in terms of this dissimilarity score,(and that have not been watched)are consid-ered the most relevant.Thefirst four scoring algorithms are based on the averagefirst-passage time and are computed from the probability transition matrix of the corresponding Markov model.Average commute time(CT).We use the average commute time,,to rank the agents of the con-sidered set,where is an agent of the user-agent set and is an agent of the set to which we compute the dissimilarity(the movie-agent set).For instance,if we want to suggest movies to people for watching, we will compute the average commute time between user agents and movie agents.The lower the value is,the more similar the two agents are.In the sequel, this quantity will simply be referred to as“commute time”.Principal components analysis defined on aver-age commute times(PCA CT).In a related paper, we showed that,based on the eigenvector decomposi-tion,the nodes vectors,,can be mapped into a new Euclidean space(with2625dimensions in this case) that preserves the Euclidean Commute Time Distance (ECTD),or a dimensional subspace keeping as much variance as possible,in terms of ECTD.We varied the dimension of the subspace,,from to by step of.It shows the percentage of vari-ance accounted for by thefirst principal compo-nents.After performing a PCA and keeping a given number of principal components, we recompute the distances in this reduced subspace. These Euclidean commute time distances between user agents and movie agents are then used in order to rank the movies for each user agent(the closest first).The best results were obtained for dimen-sions().Notice that,in the related paper,we also shows that this decomposition is similar to principal components analysis in the sense that the projection has maximal variance among all the possible candidate projections. Averagefirst-passage time(one-way).In a simi-lar way,we use the averagefirst-passage time,, to rank agent of a the movie-agent set with respect to agent of the user-agent set.This provides a dis-similarity between agent and any agent of the con-sidered set.This quantity will simply be referred to as “one-way time”.Averagefirst-passage time(return).As a dis-similarity between agent of the user-agent set and agent of the movie-agent set,we now use(the transpose of),that is,the average time used to reach(from the user-agent set)when starting from .This quantity will simply be referred to as“return time”.We now introduce other standard collaborativefil-tering methods to which we will compare our algo-rithms based onfirst-passage time.Nearest neighbours(KNN).The nearest neigh-bours method is one of the simplest and oldest meth-ods for performing general classification tasks.It can be represented by the following rule:to classify an unknown pattern,choose the class of the nearest ex-ample in the training set as measured by a similar-ity metric.When choosing the-nearest examples to classify the unknown pattern,one speaks about-nearest neighbours techniques.Using a nearest neighbours technique requires a measure of“closeness”,or“similarity”.There is often a great deal of subjectivity involved in the choice of a similarity measure(Johnson and Wichern, 2002).Important considerations include the nature of the variables(discrete,continuous,binary),scales of measurement(nominal,ordinal,interval,ratio),and subject matter knowledge.IndividualTotals IndividualTotalsTable1:Contingency table.In the case of our MAS movie architecture,pairs of agents are compared on the basis of the presence or absence of certain features.Similar agents have more features in common than do dissimilar agents.The presence or absence of a feature is described mathe-matically by using a binary variable,which assumes the value if the feature is present(if the person has watched the movie,that is if the user agent has an interaction with movie agent)and the value if the feature is absent(if the person has not watched the movie that is if the user agent has no interaction with movie agent).More precisely,each agent is characterized by a binary vector,,encoding the interactions with the movie agents(remember that there is an interac-tion between an user agent and a movie agent if the considered user has watched the considered movie). The nearest neighbours of agent are computed by taking the nearest according to a given simi-larity measure between binary vectors,.We performed systematic comparisons between eight different such measures(see(Johnson and Wichern,2002),p.674).Based on these compar-isons,we retained the measure that provide the best results:,where,,and are defined in Table1.In this table,represents the frequency of -matches between and,is the frequency of -matches,and so forth.We also varied systematically the number of neigh-bours.The best score was ob-tained with neighbours.In Section5,we only present the results ob-tained by the best-nearest neighbours model(i.e.,and).Once the-nearest neighbours are computed,the movie agents that are proposed to user agent are those that have the highest predicted values.The pre-dicted value of user agent for movie agent is com-puted as a sum weighted by of the values(or)of item for the neighbours of user agent:(5) where is defined in Equation1and we keep only the nearest neighbours.Cosine coefficient.The cosine coefficient be-tween user agents and,which measures the strength and the direction of a linear relationship between two variables,is defined byT.The predicted value of user agent for movie agent ,considering neighbours(i.e.,),is com-puted in a similar way as in the-nearest neighbours method(see Equation5).Dunham overviews in(Dunham,2003)other simi-larity measures related to cosine coefficient(i.e.,Dice similarity,Jaccard similarity and Overlap similarity). In Section5,we only show the results for the cosine coefficient,the other methods giving very close re-sults.Katz.This similarity index has been proposed in the social sciencesfield.In his attempt tofind a new social status index for evaluating status in a manner free from the deficiencies of popularity contest pro-cedures,Katz proposed in(Katz,1953)a method of computing similarities,taking into account not only the number of direct links between items but,also, the number of indirect links(going through interme-diaries)between items.The similarity matrix iswhere is the adjacency matrix and is a constant which has the force of a probability of effectiveness of a single link.A-step chain or path,then,has probability of being effective.In this sense,ac-tually measures the non-attenuation in a link, corresponding to complete attenuation and to absence of any attenuation.For the series to be con-vergent,must be less than the inverse of the spectral radius of.For the experiment,we varied systematically the value of and we only present the results obtained by the best model(i.e.,*(spectral radius)). Once we have computed the similarity matrix,the closest movie agent representing a movie that has not been watched is proposedfirst to the user agent. Dijkstra’s algorithm.Dijkstra’s algorithm solves a shortest path problem for a directed and connectedgraph which has nonnegative edge weights.As a dis-tance between two agents of theMAS architecture,we compute the shortest path between these two agents. The closest movie agent representing a movie that has not been watched is proposedfirst to the user agent. Pseudoinverse of the Laplacian matrix(). The pseudoinverse of the Laplacian matrix provides a similarity measure since is the matrix contain-ing the inner product of the vectors in the transformed space where the nodes are exactly separated by the ECTD(details are provided in a related paper).The predicted value of user agent for movie agent, considering neighbours(i.e.,),is com-puted in a similar way as in the-nearest neighbours method(see Equation5).4.3Performance evaluationThe performances of the scoring algorithms will be assessed by a variant of Somers’D,the degree of agreement(Siegel and Castellan,1988).For computing this degree of agreement,we con-sider each possible pair of movie agents and deter-mine if our method ranks the two agents of each pair in the correct order(in comparison with the test set which contains watched movies that should be rankedfirst)or not.The degree of agree-ment is therefore the proportion of pairs ranked in the correct order with respect to the total number of pairs, without considering those for which there is no pref-erence.A degree of agreement of(of all the pairs are in correct order and are in bad order)is similar to a completely random ranking.On the other hand,a degree of agreement of1means that the pro-posed ranking is identical to the ideal ranking.5Results5.1Ranking procedureFor each user agent,wefirst select the movie agents representing movies that have not been watched. Then,we rank them according to one of the proposed scoring algorithms.Finally,we compare the proposed ranking with the test set(if the ranking proce-dure performs well,we expect watched movies be-longing to the test set to be on top of the list)by using the degree of agreement.CT PCA CT One-way Return Katz KNN Dijkstra CosineTable2:Results obtained by the ranking procedures without considering the movie-category set.5.2Results and discussionsThe results of the comparison are tabulated in Table 2(where we display the degree of agreement for each method).We used the test set(which includes 10movies for each of the943users)to compute the global degree of agreement.Based on Table2,we observe that the best degree of agreement is obtained by the method(). The next degrees of agreement are obtained by the Cosine()and the-nearest neighbours method ().It is also observed that the commute time and the averagefirst-passage time(one-way)provide good results too,but are outperformed by the Co-sine,the KNN,Katz’algorithm(),and the PCA().They present a degree of agreement of and respectively.Notice,however,that the results of the method,the Cosine,the KNN, and the PCA are purely indicative,since they highly depend on the appropriate number of neighbours or on the appropriate number of principal components, which are difficult to estimate a priori.The com-mute time and the averagefirst-passage time(one-way)outperform the averagefirst-passage time(re-turn)().A method provides much worse re-sults:Dijkstra’s algorithm().It seems that,for Dijkstra algorithm,nearly each movie agent can be reached from any user agent with a shortest path dis-tance of.The degree of agreement is therefore close to because of the difficulty to rank the movies agent.5.3Computational issuesIn this section,we perform a comparison of the com-puting times(for a Pentium4,2.40GHz)for all the implemented methods:the average commute time, the principal components analysis(we consider10 components),the averagefirst-passage time one-way and return,the Katz method,the-nearest neigh-bours(we consider neighbours),the Dijkstra al-gorithm,the Cosine method(we consider againCT PCA CT One-way Return Katz KNN Dijkstra CosineTable3:Time(in sec)needed to compute predictions for all the non-watched movies and all the users neighbours),and the method.Table3shows the times,in seconds(using the Matlab cputime func-tion),needed by each method to provide predictions for all the non-watched movies agent and for each user agent(i.e.,user agents).We observe on the one hand,that the fastest method is the-nearest neighbours method and one the other hand,that the slowest methods are PCA and Dijkstra algorithm.The method which provides the best de-gree of agreement(i.e.,using the matrix as sim-ilarity measure)takes much more time than the-nearest neighbours method but is quite as fast as the Markov-based algorithms.6Conclusions and further workWe introduced a general procedure for computing dissimilarities between agents of a MAS architecture. It is based on a particular Markov-chain model of ran-dom walk through the graph.More precisely,we compute quantities(averagefirst-passage time,av-erage commute time,and the pseudoinverse of the Laplacian matrix)that provide dissimilarity measures between any pair of agents in the system.We showed through experiments performed on MAS architecture instantiated from the MovieLens database that these quantities perform well in com-parison with standard methods.In fact,as already stressed by(Klein and Randic,1993),the introduced quantities provide a very general mechanism for com-puting similarities between nodes of a graph,by ex-ploiting its structure.We are now investigating ways to improve the Markov-chain based methods.The main drawback of these methods is that it does not scale well for large MAS.Indeed,the Markov model has as many states as agents in the MAS.Thus, in the case of large MAS,we should rely on the sparseness of the data matrix as well as on iterative formulae(such as Equation4).REFERENCESBreese,J.,Heckerman,D.,and Kadie,C.(1998).Empirical analysis of pre-dictive algorithms for collaborativefiltering.Proceedings of the14th Conference on Uncertainty in Artificial Intelligence.Bremaud,P.(1999).Markov Chains:Gibbs Fields,Monte Carlo Simulation, and Queues.Springer-Verlag.Buckley,F.and Harary,F.(1990).Distance in graphs.Addison-Wesley Publishing Company.Chung,F.R.(1997).Spectral Graph Theory.American Mathematical Soci-ety.Dunham,M.(2003).Data Mining:Introductory and Advanced Topics.Pren-tice Hall.Gobel,F.and Jagers,A.(1974).Random walks on graphs.Stochastic Pro-cesses and their Applications,2:311–336.Johnson,R.and Wichern,D.(2002).Applied Multivariate Statistical Analy-sis,5th Ed.Prentice Hall.Katz,L.(1953).A new status index derived from sociometric analysis.Psy-chmetrika,18(1):39–43.Kemeny,J.G.and Snell,J.L.(1976).Finite Markov Chains.Springer-Verlag.Klein,D.J.and Randic,M.(1993).Resistance distance.Journal of Mathe-matical Chemistry,12:81–95.Konstan,J.,Miller,B.,Maltz,D.,Herlocker,J.,Gordon,L.,and Riedl,J.(1997).Grouplens:Applying collaborativefiltering to usenet news.Communications of the ACM,40(3):77–87.Norris,J.(1997).Markov Chains.Cambridge University Press.Resnick,P.,Neophytos,I.,Mitesh,S.,Bergstrom,P.,and Riedl,J.(1994).Grouplens:An open architecture for collaborativefiltering of net-news.Proceedings of the Conference on Computer Supported Coop-erative Work,pages175–186.Sarwar,B.,Karypis,G.,Konstan,J.,and Riedl,J.(2001).Item-based col-laborativefiltering recommendation algorithms.Proceedings of the International World Wide Web Conference,pages285–295.Sarwar,B.,Karypis,G.,Konstan,J.,and Riedl,J.(2002).Recommender sys-tems for large-scale e-commerce:Scalable neighborhood formation using clustering.Proceedings of the Fifth International Conference on Computer and Information Technology.Shardanand,U.and Maes,P.(1995).Social informationfiltering:Algo-rithms for automating’word of mouth’.Proceedings of the Confer-ence on Human Factors in Computing Systems,pages210–217.Siegel,S.and Castellan,J.(1988).Nonparametric Statistics for the Behav-ioral Sciences,2nd Ed.McGraw-Hill.White,S.and Smyth,P.(2003).Algorithms for estimating relative impor-tance in networks.Proceedings of the ninth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data mining,pages 266–275.Wooldridge,M.and Jennings,N.R.(1994).Intelligent agents:Theory and practice.Knowledge Engineering Review paper,2:115–152.。