分布式系统论文报告(英文)
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Yahoo S4 stream computing platform
114106000699 陈娜S4(Simple Scalable Streaming System) is initially a platform developed by Yahoo to improve the effective clicking rates of searching ADs. Through the analysis of users’ clicking rates of ADs and removing the low correlation degree of it, S4 promotes the clicking rates of ADs. So it can be regarded as a distributed stream computing model.
S4 is applied to the streaming data and real-time processing. So when it comes to business needing real-time processing, you can analyze data efficiently. Once the system has been online, rarely does it require human intervention. A steady stream of data will be analyzed and automatically routed. For huge amounts of data, S4 can process data faster. But the disadvantage is that currently the S4 data transmission is not so reliable that you may lose data. Because the data is stored in memory, all of the data in the node will be lost when the node breaks down. What’s more, S4 also has a relationship-oriented scenario. Real-time data analysis is usually for some discrete and small data. From a statistical point of view, losing part of data has no significant impact on the final results. In contrast, it can improve output significantly. So for now, S4 is more suitable for those scenes which do not need a careful analysis of each data, but only the last survey results to make appropriate adjustments and expect of the business.
When the system runs, due to the nodes are invalid and exit by other reasons, S4 still sends lots of events to the disabled node so that massive incident are missing. Because distributed stream computing framework S4 take the event key values and the number of nodes to obtain the mark of destination node, when exiting nodes, the number of nodes do not set mechanism corresponds to the change, resulting in the original processing node mark is normally hashed to and a new event will be sent to a large number of disabled nodes.Based on the above disadvantages, I put forward a dynamic node removing requirement. When a distributed stream computing framework is already running and the business does not interrupt, if the nodes are invalid and exit by other reasons, other nodes in the distributed stream computing framework can sense the new node exiting in a short period of time, and can share the exit node’s work to other nodes as soon as possible, in order to avoid a large number of new events sent to the exit node caused the loss of a large number of events for sake of ensuring the distributed stream computing framework achieving load balance after the node removed in a short period of time.Because the failure or system administrator takes into account to the replacement of the old node, the node can be exited. And for the S4 system, in order to reduce the error rate, each node is extended to two and two nodes in the content is completely consistent. When a node breaks down, the system