spark时间序列数据分析

相关主题
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

● ○
● ○
● ○
● ○
SELECT buyerid, saletime, qtysold, LAG(qtysold,1) OVER (order by buyerid, saletime) AS prev_qtysold FROM sales WHERE buyerid = 3 ORDER BY buyerid, saletime;
}.max
● ●
○ ●
○ ●

● ● ● ● ●
○源自文库
GOOG AAPL YHOO MSFT ORCL
5:00 PM $523 $384 $40 $134 $23
6:00 PM
$384 $60 $138 $30
7:00 PM $524 $385
$175 $35
8:00 PM $600 $385
$178 $45
9:00 PM $574 $378 $70 $123 $38
ZonedDateTime.parse("2015-4-10", ISO_DATE), ZonedDateTime.parse("2015-4-14", ISO_DATE))
// Fill in missing values based on linear interpolation val filled = subslice.fill("linear")
dtIndex.dateTimeAtLoc(5)
ARIMA:
val modelsRdd = tsRdd.map { ts => ARIMA.autoFit(ts)
}
GARCH:
val modelsRdd = tsRdd.map { ts => GARCH.fitModel(ts)
}
val mostAutocorrelated = tsRdd.map { ts => (TimeSeriesStatisticalTests.dwtest(ts), ts)
// Use an AR(1) model to remove serial correlations val residuals = filled.mapSeries(series =>
ar(series, 1).removeTimeDependentEffects(series))
val dtIndex = DateTimeIndex.uniform( ZonedDateTime.parse("2015-4-10", ISO_DATE), ZonedDateTime.parse("2015-5-14", ISO_DATE), 2 businessDays) // wowza that’s some syntactic sugar
Something 4.5 5.5 6.6 7.8 3.3
Something Else 100.4 101.3 450.2 600 2000

● ● ●
○ ●

Time
vec = datestr(busdays('1/2/01','1/9/01','weekly'))
vec = 05-Jan-2001 12-Jan-2001
10:00 PM $400 $345 $80 $184
GOOG AAPL YHOO MSFT ORCL
5:00 PM $523 $384 $40 $134 $23
6:00 PM
$384 $60 $138 $30
7:00 PM $524 $385
$175 $35
8:00 PM $600 $385
$178 $45
● ○
● ○
“Observations”
Timestamp
Key
2015-04-10
A
2015-04-11
A
2015-04-10
B
2015-04-11
B
2015-04-10
C
Value 2.0 3.0 4.5 1.5 6.0
“Instants”
Timestamp
A
2015-04-10
2.0
dataFrame = sqlContext.table("productRevenue")
revenue_difference = \ (func.max(dataFrame['revenue']).over(windowSpec) - dataFrame['revenue'])
dataFrame.select( dataFrame['product'], dataFrame['category'], dataFrame['revenue'], revenue_difference.alias("revenue_difference"))
Time 4/10/1990 23:54:12 4/10/1990 23:54:13 4/10/1990 23:54:14 4/10/1990 23:54:15 4/10/1990 23:54:16
Observation 4.5 5.5 6.6 7.8 3.3
Time 4/10/1990 23:54:12 4/10/1990 23:54:13 4/10/1990 23:54:14 4/10/1990 23:54:15 4/10/1990 23:54:16
9:00 PM $574 $378 $70 $123 $38
10:00 PM $400 $345 $80 $184
val tsRdd: TimeSeriesRDD = ...
// Find a sub-slice between two dates val subslice = tsRdd.slice(
1
3 | 2008-03-12 10:39:53 |
1|
1
3 | 2008-03-13 02:56:07 |
1|
1
3 | 2008-03-29 08:21:39 |
2|
1
3 | 2008-04-27 02:39:01 |
1|
2
windowSpec = \ Window .partitionBy(df['category']) \ .orderBy(df['revenue'].desc()) \ .rangeBetween(-sys.maxsize, sys.maxsize)
buyerid |
saletime
| qtysold | prev_qtysold
---------+---------------------+---------+--------------
3 | 2008-01-16 01:06:09 |
1|
3 | 2008-01-28 02:10:01 |
1|
2015-04-11
3.0
B
C
4.5
6.0
1.5
NaN
“Time Series”
DateTimeIndex: [2015-04-10, 2015-04-11]
Key
Series
A
[2.0, 3.0]
B
[4.5, 1.5]
C
[6.0, NaN]

● ○ ○ ○
rdd: RDD[String, Vector[Double]] index: DateTimeIndex
相关文档
最新文档