大数据研究徐宗本
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Sub-sampling problem: A big data set has to be processed by some types
of ‗divide-and-conquer‘ schemes, like Hadoop system.
Map (random sub-sampling)
D1
Reduce (aggregation)
Engineerings
Data acquisition& data management
Data storage& processing
Data understanding
Applications
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Scientific Researches
• • • • High-energy physics Astronomy Life science Geosciences and remote sensing
• • •
Social Governance
The fourth paradigm of research A systematic approach uniquely applicable to modern management (Jims Gray) Big data view of assessing public policies • • •
Volume
PB—ZB in scale Distributed storage and processing necessary
Velocity
Growing tremendously Data flow
Variety
Multisource, correlated, heterogeneous Unstructured, unreliable, inconsistent.
Total
Value
dataset embodies great value Individual or small subset contains less information
Big Data: Opportunities and Challenges
What opportunities:Big data embody great values that might not be explored in small sized data.
Management Science Information Science Math and statistics
Engineerings
Data acquisition& data management
Data storage& processing
Data understanding
Applications
Representation (Uniform scheme; Complexity); Modeling (Parent space identification; sampling); Mining Acquisition; Quality; Standard; Sharing; Privacy protection; Data-driven Highly domain-specific; Any data-driven (Social media based; Safety; Trade data based; Record (Survey, Architecture; System/Software/Algorithm; Scalability/Complexity; Real time processing (Clustering; Classification; Regression; Prediction; fields Variable Selection) ; Analytics (Relevance Analysis; Latent management Observation) based; Empirical data based; Experimental data based) computation) variable analytics; Statistical inference) ; Computation (Subsampling; Complexity; Distributed
Fundamental Challenge 4
Big Data Industry (Value chain management, Business pattern,…)
Outline
Big Data: Opportunities and Challenges Some More Scientific Problems in Big
Business
New chance of getting benefit/incomes Valuable customer finding Marketing
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
Exploring Big Data Analysis: Fundamental Scientific Problems
Zongben Xu
(Xi’an Jiaotong University)
Email: zbxu@mail.xjtu.edu.cn Homepage: http://zbxu.gr.xjtu.edu.cn
can be well defined? Sparse modeling High-D statistics High-D data mining (clustering stability
xt ( n ) R p( n )
, classification consistency )
Problem 2: Sub-sampling
)
(Wikipedia )
ZB(1021), EB(1018), PB(1015), TB(1012), GB(109), MB(106)
Big Data: Opportunities and Challenges
Why difficulty? Big data challenges the existing information technologies, management paradigm, statistical and computational sciences.
Management Science Information Science Math and Statistics
Engineerings
Data来自百度文库acquisition& data management
Data storage& processing
Data understanding
Applications
Y = Xn´ p b p´1
-1
ˆ = ( X ' X ) X 'Y Solution b Asymptotical normality
ˆ - b ) ~ N (0, 1 ( X ' X )-1s 2 ) ® N (0, s 2 I ) n (b p´ p n
d
Problem 1: High Dimensionality
Outline
Big Data: Opportunities and Challenges Some More Scientific Problems in Big
Data Analysis and Processing
Some Advances on Big Data Research
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
Management Science Information Science Math and statistics
Hot Issues:Sparse modeling (compressed sensing; low rank
decomposition of matrix; sparse learning) Core open questions
How to add priors so that a high-D problem
larger than the sample size (n), and n varies with p (n=n(p))
Classical:n>>p; High-D:p>>n; Big data:p>>n(p).
Linear model: y = b1x1 + b2 x2 +, , b p x p Data:D = {( x1, y1 ),( x2, y2 ), ,( xn, yn )} Matrix form:
Big Data: Opportunities and Challenges
Big Data
A term for a collection of data that are very large and complex so that it is difficult to process and analyze using on-hand database management tools, traditional data processing methods and analysis methodologies .
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Fundamental Challenge 4
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
X1 X2 X3 … … Xn D
Intermediate solution f1
Dk
Intermediate solution f2
Dm
Intermediate solution fm
The Big Data Bootstrap. Kleiner et.al. 2012 ICML
…. ….
Final estimation f*
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Fundamental Challenge 4
Challenges Challenges 3: Challenges Statistics 2& 4: : IT& Computation Big Science Data Engineerings for for Big Big Data Data Analytics Challenges 1 : Data Resource Management& Public Policies
Data Analysis and Processing
Some Advances on Big Data Research
Problem 1: High Dimensionality
High dimensionality problem:The number of features (p) is far
Problem 2: Sub-sampling
Core open questions
How to sub-sampling/aggregate so that the
final f* models properly D Is distributed processing feasible? How about traditional sub-sampling technologies work? Sub-sampling axiom (Similarity; Transitivity, …)
of ‗divide-and-conquer‘ schemes, like Hadoop system.
Map (random sub-sampling)
D1
Reduce (aggregation)
Engineerings
Data acquisition& data management
Data storage& processing
Data understanding
Applications
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Scientific Researches
• • • • High-energy physics Astronomy Life science Geosciences and remote sensing
• • •
Social Governance
The fourth paradigm of research A systematic approach uniquely applicable to modern management (Jims Gray) Big data view of assessing public policies • • •
Volume
PB—ZB in scale Distributed storage and processing necessary
Velocity
Growing tremendously Data flow
Variety
Multisource, correlated, heterogeneous Unstructured, unreliable, inconsistent.
Total
Value
dataset embodies great value Individual or small subset contains less information
Big Data: Opportunities and Challenges
What opportunities:Big data embody great values that might not be explored in small sized data.
Management Science Information Science Math and statistics
Engineerings
Data acquisition& data management
Data storage& processing
Data understanding
Applications
Representation (Uniform scheme; Complexity); Modeling (Parent space identification; sampling); Mining Acquisition; Quality; Standard; Sharing; Privacy protection; Data-driven Highly domain-specific; Any data-driven (Social media based; Safety; Trade data based; Record (Survey, Architecture; System/Software/Algorithm; Scalability/Complexity; Real time processing (Clustering; Classification; Regression; Prediction; fields Variable Selection) ; Analytics (Relevance Analysis; Latent management Observation) based; Empirical data based; Experimental data based) computation) variable analytics; Statistical inference) ; Computation (Subsampling; Complexity; Distributed
Fundamental Challenge 4
Big Data Industry (Value chain management, Business pattern,…)
Outline
Big Data: Opportunities and Challenges Some More Scientific Problems in Big
Business
New chance of getting benefit/incomes Valuable customer finding Marketing
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
Exploring Big Data Analysis: Fundamental Scientific Problems
Zongben Xu
(Xi’an Jiaotong University)
Email: zbxu@mail.xjtu.edu.cn Homepage: http://zbxu.gr.xjtu.edu.cn
can be well defined? Sparse modeling High-D statistics High-D data mining (clustering stability
xt ( n ) R p( n )
, classification consistency )
Problem 2: Sub-sampling
)
(Wikipedia )
ZB(1021), EB(1018), PB(1015), TB(1012), GB(109), MB(106)
Big Data: Opportunities and Challenges
Why difficulty? Big data challenges the existing information technologies, management paradigm, statistical and computational sciences.
Management Science Information Science Math and Statistics
Engineerings
Data来自百度文库acquisition& data management
Data storage& processing
Data understanding
Applications
Y = Xn´ p b p´1
-1
ˆ = ( X ' X ) X 'Y Solution b Asymptotical normality
ˆ - b ) ~ N (0, 1 ( X ' X )-1s 2 ) ® N (0, s 2 I ) n (b p´ p n
d
Problem 1: High Dimensionality
Outline
Big Data: Opportunities and Challenges Some More Scientific Problems in Big
Data Analysis and Processing
Some Advances on Big Data Research
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
Management Science Information Science Math and statistics
Hot Issues:Sparse modeling (compressed sensing; low rank
decomposition of matrix; sparse learning) Core open questions
How to add priors so that a high-D problem
larger than the sample size (n), and n varies with p (n=n(p))
Classical:n>>p; High-D:p>>n; Big data:p>>n(p).
Linear model: y = b1x1 + b2 x2 +, , b p x p Data:D = {( x1, y1 ),( x2, y2 ), ,( xn, yn )} Matrix form:
Big Data: Opportunities and Challenges
Big Data
A term for a collection of data that are very large and complex so that it is difficult to process and analyze using on-hand database management tools, traditional data processing methods and analysis methodologies .
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Fundamental Challenge 4
Big Data: Opportunities and Challenges
Big data research: A real inter/multidisciplinary activities.
X1 X2 X3 … … Xn D
Intermediate solution f1
Dk
Intermediate solution f2
Dm
Intermediate solution fm
The Big Data Bootstrap. Kleiner et.al. 2012 ICML
…. ….
Final estimation f*
Fundamental Challenge 1
Fundamental Challenge 2
Fundamental Challenge 3
Fundamental Challenge 4
Challenges Challenges 3: Challenges Statistics 2& 4: : IT& Computation Big Science Data Engineerings for for Big Big Data Data Analytics Challenges 1 : Data Resource Management& Public Policies
Data Analysis and Processing
Some Advances on Big Data Research
Problem 1: High Dimensionality
High dimensionality problem:The number of features (p) is far
Problem 2: Sub-sampling
Core open questions
How to sub-sampling/aggregate so that the
final f* models properly D Is distributed processing feasible? How about traditional sub-sampling technologies work? Sub-sampling axiom (Similarity; Transitivity, …)