中国移动hadoop数据挖掘平台介绍
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
The pressure on database is over load.
ELT makes large pressure on database, in branch company, one server cant support all operation.
Data back up can’t be support well
内部资料 注意保密
What’s BASS
BASS (Business Analysis Support System) is a BI system for CMCC to support enterprise decision-making, marketing management analysis, and sales. BASS includes data extract layer, data process layer, data display layer, application layer Main operation in data process layer is:
ETL (14 different ETL operations from 6 categories based on MapReduce)
Statistic, attribute processing, data sampling, query, data processing, redundancy data processing
Features of BC-PDM (I)
» Targeting general data analysis and data mining platform/tools
BC-PDM(phase I)
Workflow management
GUI - Drag Operation for application modeling design Job Monitoring Flow Configuration
What is the BC-PDM
BC-PDM: Big Cloud based Parallel Data Mining Platform
A data mining solution for large-scale data analysis
Massive scalability - based on Hadoop Low cost - commodity machines and free software Customization – facing to application requirements Easy to use - similar user interface to commercial tools
内部资料 注意保密
BC-PDM Architecture
Data mining App
•Large Scale Data Process •Large Scale Data Mining •Excellent scalability DE
DT
•Large Scale Storage •High performance •High Availablity •Low Price 内部资料 注意保密
Outline
Introduce BC-PDM architecture
Architecture Features compared between phase I and phase II
Conclusions and Future works
Conclusions Future works
内部资料 注意保密
Voice: 100million* 1KB = 100GB/day SMS: 100~200 million * 1KB = 100~200GB/day ……
Network signaling data, for a branch company (> 20 million subscribers) GPRS signaling data: 48GB/day for a branch companies 3G signaling data: 300GB/day for a branch companies voice, SMS signaling data, ……
› Data mining Algorithm (4 more)
• Classifier, Sequence Association Analysis
Data mining Algorithm (9 algorithms from 3 categories based on MapReduce)
Clustering, Classifier, Association Analysis
Output Data
内部资料 注意保密
关键技术方案-并行ETL-冗余删除
功能 冗余删除操作实现了针对所有数据样本中完全相 同的两条或多条记录进行删除,只保留相同记录 中的一条记录。 1)实现数据表冗余删除的并行化 2)正确性与串行结果完全一致 3)加速比接近线性,TB级处理时间千秒级 数据库中的串行冗余删除 1)通过map对待处理数据进行分块处理,每个数 据块对应一个处理节点;map中输入的key为默 认值——每行数据的偏移量,value为该行数据的 文本形式,以此方式实现在每块中依次读入每行 数据;map任务输出中间<key,value>对,其中 ,key从整行数据文本,value为空文本; 2)对具有相同key值的数据由reduce输出:key 为整行数据,value值为空,即可实现同样的数据 记录仅保留一条数据记录; 将reduce输出结果存 储到分布式文件系统。
Parallel Data Mining Platform in Telecom Industry
-- Big Cloud based Parallel Data Mining Platform Friday, Oct 2, 2009 NYC
Research Institute of China Mobile Communication Corporation Feng Cao
Set the targe fields to Key, other fields to Value
ReduceTasker 1
Reduce the same key, read from the value list and write once
ReduceTasker m
Reduce the same key, read from the value list and write once
内部资料 注意保密
Large Scale Data Applications and current solution
The Requirements
Precision marketing
Analysis of User Behavior Customer Churn Prediction Service Association Analysis ……
Data extract from other system, Data transfer Data gather Data statics …
Based on database system, most of operation are deal in database, which realizes ELT(Extract, Load and Transfer), rather than ETL. 内部资料 注意保密
内部资料 注意保密
Features of BC-PDM(II)
» Targeting general data analysis and data mining platform/tools
BC-PDM(phase I)
VisuLeabharlann Baidulization
Text, decision tree, cake graph, and histogram
Enterprise Miner Clemetine Intelligent Miner
Service Optimization and Log Processing
Spam Message Filtering ……
Most are running on Unix Servers, data stored in Storage Arrays
Large scale data in China Mobile Communication Corporation (CMCC)
Subscribers: 500 million Subscribers’ CDR(calling data record) data 5~8TB/day in CMCC For a branch company (> 20 million subscribers)
Set the targe fields to Key, other fields to Value
Define the target fields (one or all)
MapTasker 2
Set the targe fields to Key, other fields to Value
MapTasker n
Challenges and limitations of BASS
The invest of Hardware is large, and the enlargement is high cost.
62% invest is on hardware Because there’s different critia between the unix server, when enlargement, we should buy totally new unix servers rather than just makeup some unix servers.
内部资料 注意保密
Case I – Mapreduce based ETL
Function- Redundancy Remove
To delete the same records in a CDR, and reserve the unique one.
Input Data
MapTasker 1
» BC-PDM(phase II)
› Web based GUI
› Provide SaaS mode for users
› Data Transfer Tool
› Provide data upload and download tools for SaaS
› Security
› Multi-tanent and user group for branch, ACL for data access
» BC-PDM(phase II)
› DE(Data Exploration) › Simple data analysis and preview › ETL (25 more)
• To simulate SQL operation, support Join, Group by, Expression, case when, Update, and etc.
Current solution
Commercial database / data warehouse systems
Commercial Data Mining Tools
Network Optimization
Network QOS Analysis Singalling Data Analysis ......
Off line data back up (5 branches) cost lots of time, online data back up(8 branches) cost lots of resource, file back up (18branches) restore slowly
内部资料 注意保密
The management of IT system is complex.
One unix server can’t support a BASS, in every branch subsystme, there’s about 3-5 servers, such as ETL Server, Database Server, Interface Server, and Display server.