R-Script for Mainframe Metric Analysis (Regression\ARIMA) Review
- From: Idgarad <idgarad@xxxxxxxxx>
- Date: Thu, 5 Jun 2008 10:16:33 -0700 (PDT)
I could use some suggestions on a script I am working on for some
basic forecasting of mainframe usage. I am having trouble
understanding and figuring out how to handle additive outliers and
innovative outliers as well as pulse detection and handling (transfer
functions.)
I am attempting to write a rather automatic forecasting tool to help
out some folks with understanding mainframe activity and want to make
it open an usable to others that, like myself have little training in
statistics to make some basic forecasts with some sound statistics
behind it. (Doing this as a learning experience.)
Any comments, criticisms, and suggestions welcome.
Id.
############################
# Mainframe MIPS Analysis
# --------------------------
# Primary Author: Idgarad
#
#
# Known Issues & To Do
# Still Vulnerable to Outliers
# AO and IO handling
# Pulse Detection
############################
############################
# Initial Library Loads
############################
library(RODBC)
library(forecast)
library("geneplotter")
library(forecast)
library(fUtilities)
library(TSA)
require(gplots)
############################
#Define Graph Palette
############################
mypalette=c()
mypalette$background="#FFFFFF"
mypalette$chart="#FFFFFF"
mypalette$forecastRegion="#66CCFF"
mypalette$confidence="#FF9966"
mypalette$limits="#FF0000"
mypalette$major="#000000"
mypalette$minor="#cccccc"
mypalette$actual="#aaaaaa"
mypalette$dp1="#9900FF"
mypalette$dp2="#EEEE00"
mypalette$dp3="#CCFF00"
mypalette$dp4="#00CCFF"
mypalette$dp5="#FF00CC"
############################
# Using the Forecasting Process as defined in
# "Introduction to Time Series Analysis and Forecasting"
# P. 12, Montgomery, Jennings, Kulahci
# ISBN:978-0-471-65397-4
#
# 1)Problem Definition
# 2)Data Collection
# 3)Data Analysis
# 4)Model Selection and Fitting
# 5)Model Validation
# 6)Forecasting Model Deployment
# 7)Monitor Forecasting Model Performance
############################
############################
# Problem Definition
#
# The Mainframe testing environment MIPS usage is a strong indicator
of potential
# performance issues. The ability to forecast MIPS usage can help us
find periods in
# which a test environment may have reduced turn-around times on jobs
or batches.
#
# In addition this allows us to then plan for conversion periods when
MIPS are
# abundant based on the existing enterprise calendar of activities.
#
#
# The mainframe environment is broken into several environments
(LPARS)
# (ABCD)UNIT - Development
# (KBLE)IT - Integration
# (U000)UAT - Acceptance
# (UB40)UAT - Acceptance Carry Over
# (MNOP)TRA - Training
# (PAPA)PROD - Production
# (All but PAPA are considered Test Environments)
#
# Each environment is allocated a given number of MIPS based on
weights. Test environments
# can borrow unsed MIPS from PAPA, limited only be the number of
unused MIPS and how many
# engines are allocated to the test environment.
#
# The objective is to model and forecast the usage patterns for each
test environment as
# accurately as possible. Those forecasts are then used in a
spread*** for planning purposes.
# This entire process and analysis must be nearly 100% automated and
run on a scheduled basis
# as there are no resources with a statistical degree to maintain this
on a regular basis.
############################
############################
# Data Collection
#
# The data is collected weekly in the
# form of the actual used MIPS. Activities
# are coordinated on a quarterly basis with
# variations in each quarterly release. Due
# to those variations the assumed seasonal
# period is treated weekly. Due to how the
# calendar functions though, certain activities
# can shift a week back or forth as a result of
# holidays bumping certain activities.
#
# MIP usage is largely driven by batch activity
# rather then actual people in the evening and
# largely driven by people during the day. As we
# get weekly data we have to model holidays to
# factor the influence on MIPS driven by people.
#
# To ensure that this script is reuseble we want
# to handle holidays automatically and in a configurable
# manner so the script can localize approproiately.
############################
############################
# Data Analysis
#
# Data in the script will be loaded in from
# Excel sheets generated from mainframe reports.
#
# As Part of the data analysis phase we need to
# to present the following to meet some basic
# statistical reporting requirements:
#
# Deliverables:
# 1)The Raw Data Plotted (Time Series Plot)
# 2)Identification of Outliers in the Original Data (QQ Plots, Box
Plots)
# 3)Standard decompositions of the Data (STL plot, QQ, Histograms)
# 4)The List of Regressors Available (Table Output)
# a)This includes generating regressors to account for outliers and
holidays
# 5)The Analysis and selection of Regressors to be used in the model
############################
############################
# Model Selection
#
# We know that MIP usage is determined by project work and
# scheduled activities. As such the calendar of events directs
# system availability and reflects the standard project and
# system lifecycle. The calls for regression of those calendar
# entries. We also know that the work done is seasonal in nature.
# (Quarterly Releases) as well as human productivity.
#
#
# A significant challenge though is there will be level shifts that
occur
# as a result of new policies that come into effect. Level shift will
need
# to be addressed (The policy didn't exist in the past but does
starting at X and going forward.
#
# As suggested by Rob Hyndman:
#(Professor of Statistics, Monash University;Editor-in-Chief,
International Journal of Forecasting)
#<quote>
# ... the simplest approach would be to use lm to fit a standard
regression
# without any autocorrelated errors. Then use auto.arima to select
the model
# for the residuals from that regression. For example:
#
# fit <- lm(y ~ x + z) #where x and z are the regressors, seasonal
dummies, etc.
# res <- residuals(fit)
# fit.arma <- auto.arima(res)
# final.fit <- Arima(...)
#
# The final line would piece together the various parts of the model.
#</quote>
#
#
# Using that suggestion as our basis at this time we will look at 5
possible models
# m1:lm() using all the regressors
# m2:lm() using all the regressors less the intercept
# m3:lm() using select regressors (automatic step elimination)
# m4:lm() using select regressors (automatic step elimination) less
the incercept
# m5:rlm()using all the regressors; this is purely a test and could be
# substituted for a hand-tuned regressor
#
# Given the automated requirement, model revisions and IO (innovative
outliers)
# need to be handled automatically. Periodic model changes are likely
with a review
# of models done quarterly. (In short, this script gets run once each
quarter)
#
# Thus the model should be determined using 3/4th of the available,
current data at
# runtime, validated against the remaining 1/4th of actual data, and
produce
# a forecast of the next 52 weeks for planning purposes.
#
#
# Deliverables:
# 1)Model Definitions
# 2)Model Accuracy Components (AIC,MAPE,etc.)
# 3)Graph comparing models (TS Plot with fitted models)
############################
############################
# Model Validation
#
# The model will be validated using in-sample testing and out-of-
sample testing
# as described above. The validity of the model as a whole will be
detemined by
# observation over the course of 1 year from the completion of the
script
# and the automated execution of the script 4 times during that year
of observation.
#
#
# Deliverables:
# 1)Utilization Graph with Actual versus Forecast (Excel) Updated
weekly, revised quarterly
# 2)Script outputs an file to be imported into Excel.
############################
############################
# Forecasting Model Deployment
#
# The forecasts will be used as part of a quarterly metrics report
showing the next
# 52 weeks of planned usage and available reserves.
#
# Deliverables:
# 1)Utilization Graphs with expected reserves
# 2)Quarterly Report indicating MIPS forecasts
# 3)as well as batch turnaround reports (out of scope of this script)
#############################
############################
# Monitoring Forecasting Model Performance
#
# To ensure that the model is accurate this script will be run
quarterly to see
# if any model changes are needed. Successful accuracy should be
determined that
# no forecast should off by more then 2 standard deviations for that
quarterly time
# period. Reserves will be considered adaquate if they can meet the
forecast +/- 2
# standard deviations.
#
#
# Deliverable:
# Control Charts
############################
# Really should make this a REALLY big loop or perhaps a function....
#Raw Data
channel1 <- odbcConnectExcel("D:/RSTATS/metrics.xls")
sqlTables(channel1)
sh1 <- sqlFetch(channel1, "Actuals$")
#Table of dummy regressors date,r1,r2,r3,etc....
channel2 <- odbcConnectExcel("D:/RSTATS/events.xls")
sqlTables(channel2)
sh2 <- sqlFetch(channel2, "data$")
#(TODO) Prep Out-of-sample test range
#modLength=length(sh1$U000[,1])
# Get 3/4ths
#modMax=round((modlength/4)*3)
#Ok U000 data to a Time Series
tsU000=ts(sh1$U000,start=c(2004,1),freq=52)
summary(tsU000)
# This Heavy Voodoo™ allows us to have a dynamic number of
#dummy variables we can add\remove from the spread***
forecastDistance <- 52
#Grab Existing Regressors (clipping out the date)
cReg <- sh2[1:length(tsU000),-1]
#Grab X Future Regressors equal to the forecastDistance (gotta double
check if I need a +1 on the start point)
fReg <- sh2[length(tsU000):(length(tsU000)+forecastDistance),-1]
#end heavy vodoo
#at this point we should add holiday dummy variables to cReg and fReg
# no known way found yet
linearModel1=lm(tsU000 ~ .,cReg)
linearModel2=step(linearModel1)
linearModel3=lm(tsU000 ~ .-1,cReg)
linearModel4=step(linearModel3)
linearModel5=rlm(tsU000 ~ .,cReg)
LinearModel1.res <- residuals(linearModel1)
LinearModel2.res <- residuals(linearModel2)
LinearModel3.res <- residuals(linearModel3)
LinearModel4.res <- residuals(linearModel4)
LinearModel5.res <- residuals(linearModel5)
arma1.fit <- auto.arima(LinearModel1.res,frequency(52))
arma2.fit <- auto.arima(LinearModel2.res,frequency(52))
arma3.fit <- auto.arima(LinearModel3.res,frequency(52))
arma4.fit <- auto.arima(LinearModel4.res,frequency(52))
arma5.fit <- auto.arima(LinearModel5.res,frequency(52))
#AO's can be added to the cReg as a normal dummy variable
# but these are AOs from the model not the original data.
# is it better to handle AOs from the original data? Are they the
same?
#move up prior to lm section and append to cReg once we find out...
arma1.ao <- detectAO(arma1.fit)
arma2.ao <- detectAO(arma2.fit)
arma3.ao <- detectAO(arma3.fit)
arma4.ao <- detectAO(arma4.fit)
arma5.ao <- detectAO(arma5.fit)
#What that hell do I do with an innovative outlier? Transfer function
or what?
#auto.arima doesn't handle the IO=c(...) stuff.... Umm...
#transfer functions, etc. are a deficency....
arma1.io <- detectIO(arma1.fit)
arma2.io <- detectIO(arma2.fit)
arma3.io <- detectIO(arma3.fit)
arma4.io <- detectIO(arma4.fit)
arma5.io <- detectIO(arma5.fit)
#Sample on how to auto-grab regressors from DetectAO and DetectIO and
#appened them to our regression array. You'd have to do this for each
model
#as the residuals are where the outliers are coming from and diff
models
#would have different residuals left over. IO is best left to arimax
functions
#directly. I assume at this point that AO's can be added to Regression
tables
#if that is the case then REM out the IO lines and pass the detectIO
results
#into the arimax(x,y,z,IO=detectIO(blah))
#
# Need a better understanding of how to address the AO and IO's in
this script before implementing them
# (Repeat for each model, cReg1,cReg2//arma1.ao,arma2.ao; etc..)
# the print is in there for my own sanity....
#cReg1=cReg
#fReg1=fReg
#for(i in arma1.io$ind){ print(i);cReg1[,paste(sep="
","IO",i)]=1*(seq(cReg1[,2])==i)}
#for(i in arma1.ao$ind){ print(i);cReg1[,paste(sep="
","AO",i)]=1*(seq(cReg1[,2])==i)}
#for(i in arma1.io$ind){ print(i);fReg1[,paste(sep="
","IO",i)]=1*(seq(fReg1[,2]))}
#for(i in arma1.ao$ind){ print(i);fReg1[,paste(sep="
","AO",i)]=1*(seq(fReg1[,2]))}
#Get the pdq,PDQs into a variable so we can re-feed it if neccessary
arma1.fit$order=c(arma1.fit$arma[1],arma1.fit$arma[2],arma1.fit
$arma[6])
arma2.fit$order=c(arma2.fit$arma[1],arma2.fit$arma[2],arma2.fit
$arma[6])
arma3.fit$order=c(arma3.fit$arma[1],arma3.fit$arma[2],arma3.fit
$arma[6])
arma4.fit$order=c(arma4.fit$arma[1],arma4.fit$arma[2],arma4.fit
$arma[6])
arma5.fit$order=c(arma5.fit$arma[1],arma5.fit$arma[2],arma5.fit
$arma[6])
arma1.fit$seasonal=c(arma1.fit$arma[3],arma1.fit$arma[4],arma1.fit
$arma[7])
arma2.fit$seasonal=c(arma2.fit$arma[3],arma2.fit$arma[4],arma2.fit
$arma[7])
arma3.fit$seasonal=c(arma3.fit$arma[3],arma3.fit$arma[4],arma3.fit
$arma[7])
arma4.fit$seasonal=c(arma4.fit$arma[3],arma4.fit$arma[4],arma4.fit
$arma[7])
arma5.fit$seasonal=c(arma5.fit$arma[3],arma5.fit$arma[4],arma5.fit
$arma[7])
#these Two are used for linearModel2 and linearModel4, Get only the
#regressors that surived step removal.
newcReg=cReg[match(names(linearModel2$coeff[-1]),names(cReg))]
newfReg=fReg[match(names(linearModel2$coeff[-1]),names(fReg))]
#Scenario 1 - All Regressors Left In
newFit1.a <- Arima(tsU000,order=arma1.fit
$order,seasonal=list(order=arma1.fit
$seasonal),xreg=cReg,include.drift=T)
newFit1.b <- Arima(tsU000,order=arma1.fit
$order,seasonal=list(order=arma1.fit
$seasonal),xreg=cReg,include.drift=F)
#Scenario 2 - Step Removal of Regressors
newFit2.a <- Arima(tsU000,order=arma2.fit
$order,seasonal=list(order=arma2.fit
$seasonal),xreg=newcReg,include.drift=T)
newFit2.b <- Arima(tsU000,order=arma2.fit
$order,seasonal=list(order=arma2.fit
$seasonal),xreg=newcReg,include.drift=F)
#Scenario 3 - All Regressors Left In with Intercept Removed
newFit3.a <- Arima(tsU000,order=arma3.fit
$order,seasonal=list(order=arma3.fit
$seasonal),xreg=cReg,include.drift=T)
newFit3.b <- Arima(tsU000,order=arma3.fit
$order,seasonal=list(order=arma3.fit
$seasonal),xreg=cReg,include.drift=F)
#Scenario 4 - Step Removal of Regressors with Intercept Removed (I
have a feeling this is identical to #2 in results
newFit4.a <- Arima(tsU000,order=arma4.fit
$order,seasonal=list(order=arma4.fit
$seasonal),xreg=newcReg,include.drift=T)
newFit4.b <- Arima(tsU000,order=arma4.fit
$order,seasonal=list(order=arma4.fit
$seasonal),xreg=newcReg,include.drift=F)
#Scenario 5 - Robust1, For giggles and grins for now
newFit5.a <- Arima(tsU000,order=arma5.fit
$order,seasonal=list(order=arma5.fit
$seasonal),xreg=newcReg,include.drift=T)
newFit5.b <- Arima(tsU000,order=arma5.fit
$order,seasonal=list(order=arma5.fit
$seasonal),xreg=newcReg,include.drift=F)
#All the drift terms ones are broke as the drift term doesn't match up
to the # of columns in fReg
#Still don't know how to fix at this time.
forecast1.a <-predict(newFit1.a,n.ahead=53,newxreg=fReg)
forecast1.b <-predict(newFit1.b,n.ahead=53,newxreg=fReg)
forecast2.a <-predict(newFit2.a,n.ahead=53,newxreg=newfReg)
forecast2.b <-predict(newFit2.b,n.ahead=53,newxreg=newfReg)
forecast3.a <-predict(newFit3.a,n.ahead=53,newxreg=fReg)
forecast3.b <-predict(newFit3.b,n.ahead=53,newxreg=fReg)
forecast4.a <-predict(newFit4.a,n.ahead=53,newxreg=newfReg)
forecast4.b <-predict(newFit4.b,n.ahead=53,newxreg=newfReg)
forecast5.a <-predict(newFit5.a,n.ahead=53,newxreg=newfReg)
forecast5.b <-predict(newFit5.b,n.ahead=53,newxreg=newfReg)
#Diagnostic Plot for model selection
#everything after this point needs to be adjusted based on the model
chosen
plot(tsU000,col="grey")
lines(fitted(newFit1.b),col=mypalette$dp1)
lines(fitted(newFit2.b),col=mypalette$dp2)
lines(fitted(newFit3.b),col=mypalette$dp3)
lines(fitted(newFit4.b),col=mypalette$dp4)
lines(fitted(newFit5.b),col=mypalette$dp5)
# Script Deliverables
write.table(forecast1.b,file="D:/RSTATS/U000/
U000Forecast.csv",sep=",")
#Two Model Choices, their fits and the selected model's forecast with
CI
#Based off a chart on RGallery
#todo (all charts in golden ratio vs. square)
png("D:/RSTATS/U000/images/Time Series U000 Forecast and Fit.png"
ylim <- c( min(tsU000,forecast1.b$pred - 1.96 * forecast1.b$se),
max(tsU000,forecast1.b$pred + 1.96 * forecast1.b$se))
originalMaxDate = max(time(tsU000))
originalMinDate = min(time(tsU000))
f1MaxDate = max(time(forecast1.b$pred))
f1MinDate = min(time(forecast1.b$pred))
xlim <- c( originalMinDate,f1MaxDate)
opar <- par(mar=c(4,4,2,2),las=1)
plot(tsU000,ylim=ylim,type="n",xlim=xlim)
usr <-par("usr")
rect(usr[1],usr[3],f1MinDate, usr[4],border=NA,col=mypalette$chart)
rect(f1MinDate,usr[3],usr[2],usr[4],border=NA,col=mypalette
$forecastRegion)
abline(h= seq(from=0,to=(ylim[2]+100),by=10), col=mypalette$minor,
lty=3)
abline(h= seq(from=0,to=(ylim[2]+100),by=100), col=mypalette$major,
lty=3)
polygon( c(time(forecast1.b$pred),rev(time(forecast1.b$pred))),
c(forecast1.b$pred - 1.96*forecast1.b$se,rev(forecast1.b$pred
+ 1.96*forecast1.b$se)),
col = mypalette$confidence,
lty=2,border=NA)
lines(forecast1.b$pred - 1.96*forecast1.b$se,lty=2, col=mypalette
$limit)
lines(forecast1.b$pred + 1.96*forecast1.b$se,lty=2, col=mypalette
$limit)
lines(tsU000,lwd=1,col=mypalette$actual,pch=20)
lines(forecast1.b$pred,col=mypalette$forecast)
#Which ever the selected model is make a solid line
lines(fitted(newFit1.b),col=mypalette$dp1,pch=20,type="p")
lines(fitted(newFit2.b),col=mypalette$dp2,lty="dashed")
#lines(fitted(newFit3.b),col=mypalette$dp3,lty="dashed")
#lines(fitted(newFit4.b),col=mypalette$dp4,lty="dashed")
#lines(fitted(newFit5.b),col=mypalette$dp5,lty="dashed")
smartlegend( x="left", y= "top", inset=0,
#smartlegend parameters
legend = c("series","95% CI","forecast"
,"model 1"
,"model 2"
#,"model 3"
#,"model 4"
#,"model 5"
),
fill=c(mypalette$actual,mypalette$confidence,mypalette$forecast
,mypalette$dp1
,mypalette$dp2
#,mypalette$dp3
#,mypalette$dp4
#,mypalette$dp5
),bg = mypalette$background)
box()
par(opar)
dev.off()
# Time Series Decomposition
png("D:/RSTATS/U000/images/Time Series U000 Base MIPS.png")
plot(stl(tsU000,s.window="periodic"),main="Time Series Base MIPS
Recorded")
dev.off()
# I don't even remember why this is here....
k= c(.5,1,1,1,.5)
(k = k/sum(k))
ftsU000=filter(tsU000,sides=2, k)
#Lowess Fit
png("D:/RSTATS/U000/images/Time Series U000 Lowess.png")
plot(tsU000)
lines(ftsU000, col="red",lty="dashed")
lines(lowess(tsU000),col="blue", lty="dashed")
dev.off()
#Why on Earth did I put this in here?
shapiro.test(diff(log(tsU000)))
#--- Histogram and Q-Q Plot
png("D:/RSTATS/U000/images/Time Series U000 Histogram and QQ.png")
par(mfrow=c(2,2))
hist(tsU000,prob=T,12,col=mypalette$dp3)
lines(density(tsU000,na.rm=T),col=mypalette$dp1,lwd=2)
qqnorm(tsU000)
qqline(tsU000)
hist(diff(log(tsU000)),prob=T,12,col=mypalette$dp3)
lines(density(diff(log(tsU000)),na.rm=T),col=mypalette$dp1,lwd=2)
qqnorm(diff(log(tsU000)))
qqline(diff(log(tsU000)))
dev.off()
png("D:/RSTATS/U000/images/Time Series U000 Lag Plot.png")
lag.plot(tsU000,18,do.lines=F)
dev.off()
png("D:/RSTATS/U000/images/Time Series U000 DIFF-LOG ACF and
PACF.png")
par(mfrow=c(2,1))
acf(diff(log(tsU000)),52)
pacf(diff(log(tsU000)),52)
dev.off()
png("D:/RSTATS/U000/images/Time Series U000 ACF and PACF.png")
par(mfrow=c(2,1))
acf(tsU000,52)
pacf(tsU000,52)
dev.off()
png("D:/RSTATS/U000/images/LOG Time Series U000 Decompositions.png")
plot(strdecom<-stl(log(tsU000),"per"))
dev.off()
################### END SCRIPT #########################
Am I hitting the necessary points for a decent forecast methodology?
Any suggestions or areas that need improvement?
.
- Follow-Ups:
- Next by Date: Re: What is power and size of hypothesis test?
- Next by thread: Re: R-Script for Mainframe Metric Analysis (Regression\ARIMA) Review
- Index(es):
Loading