These are tips focused on enterprise-level organisations wanting to implement Microsoft’s Azure Machine Learning studio. But I’m sure the content translates well to other settings.

Effective analytics and machine learning requires coordination of **People, Process & Platform**. But do you need all three?

Yes. But you don’t need all three to get started.

Most Data Scientists will not care about and will not identify as owning MLOps. They want to lead experimentation and innovation, not building production-grade ML pipelines. You will either have to:

- Find an internal champion to lead this and train others.
- Hire an ML engineer if you have enough work to keep them busy (A recent ML Engineer job ad I saw was around $650k, so bear that in mind).

- Work with a specialist partner / consultant to fill this gap and support your teams doing what they are good at.

Converting experimental analytics POCs into robust MLOps-style pipelines is not trivial. It takes real, behind the scenes work which is kind of thankless. But sometimes you need to eat your vegetables and do this. Ensuring you have a well crafted process for governing how, when and why MLOps should exist will help secure the time, funding and social capital you need.

Want to know where you are going to face challenges and delays implementing an AI/Ml platform?

Wonder no more: Network Security.

Azure ML Studio (and any ML platform) will require serious planning around network architecture and security, and for good reason. It’s not as scary as it sounds, but it does require expertise. You need to have strong representation on your project from a decision maker and a ‘doer’ in network security. If these people aren’t inside the tent with you, your project will grind to a halt.

If we are being honest, not many organisations are ready for scalable, reproducible, high performance ML deployments. But almost all orgs are building toward this in the next 5 years. So what’s most useful in the here and now?

- Easy to access online notebooks with various kernels
- Click of a button scalable compute, without fussing about with the infra layer
- AutoML tools to find performance ceilings and explore the solution space of new problems

Want to get your data scientist on board? Tell them they wont have to run jobs on their laptops overnight anymore.

There a lot to complain about if you are an R user. Azure ML studio is decidedly a Python oriented tool. While admittedly, most high-level AI development work is in Python, in reality a lot of so called “ML” projects are classical statistical models anyway. There are nice tools within R ecosystem for putting R in prod. However if you work in an enterprise setting, you may need to deploy your models on the chosen enterprise analytics platform. This is not ideal for R users, but hopefully in a future post I can show you how to have the best of both worlds.

Like any tool its only good for what it’s good for. When we think of true end-to-end machine learning projects there are steps in that process that are better fit in other tools. A common suspect here is the data import and processing. It’s nice to coordinate ML deployments with your data engineering teams to ensure you are using the right tools for the right jobs.

If you are wanting to take the next steps in setting up people, process and platforms for high performance machine learning, get in touch for a chat.

The consequences of poorly built statistical models are not trivial. In 2015, Amazon realized its ‘AI recruiting tool’ didn’t like women^{1}. In the US, facial recognition models used by police were found to have biases that more commonly misidentified underrepresented communities.^{2}

So what can you do about it?

- What training data was used, and how was this collected?
- Could certain sub-populations be over or under-represented in this data?
- Have customer details been anonymised?

- Are sensitive features used in the model training (race, religion, gender, political preferences).

Algorithmic unfairness and biases are already huge issues which are only getting worse. Further reading on this topic is covered by Cathy O’Neil’s excellent book *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*^{3}.

All statistical models have assumptions that are made in order for the math to work out. Models are intended to be simplified representations of reality, deliberately so statisticians can exploit properties like the Central Limit Theorem to help make inferences. For example the assumptions behind a simple linear regression include:

- The response variable can be expressed as a linear combination of the predictors.

- The variance of the error terms is homoscedastic (has constant variance).

- The errors in the response are independent.

In many cases these can be checked using standard diagnostics tests and plots. However in most cases it requires more in depth domain knowledge and context.

Ever been told a model is 99% accurate? I’d be very worried if you had. Be skeptical of very high performance.

xkcd.com highlight this well with their ‘Is it Christmas?’ predictive model.

On what data has the model been tested? Is it a completely independent test set? Was there any leakage from the training data? Was the feature engineering done before or after the training/test split (hint: it usually needs to be after).

If your model is a binary classification model, you should know what the ‘null model’ is and whether it outperforms this.

*Accuracy*is just one measure and is the proportion of correct classifications (both positive and negative class) but you may have different tolerance for misclassifications of each class. For example, you might be predicting the presence of a disease from a test. If you make a false positive, will the risk of side-effects or cost outweigh the risk the making a false negative and having the patient get sicker or die? It’s tricky.In addition to accuracy ask for the following metrics, along with an explanation of what they mean.

- Sensitivity / Recall
- Specificity
- Precision / Positive Predictive Value
- Negative Predictive Value
- ROC AUC

Ask for a confusion matrix. All of the above measures (except ROC) can be calculated from the confusion matrix. Despite its name, it will be instantly clear where the model is working well versus not.

Ask what cut-off or threshold was used to make the predicted classifications. Many classification models return conditional class probabilities, which need to be converted into labels such as (Yes/No, Cancer/Not Cancer, Churn/Not Churn). A default value is to use a 0.5 probability for the crisp cutoff, but it’s subjective and depends on the desired trade off in sensitivity / specificity as well as other complicated factors like training class imbalance.

If you don’t get good answers to these questions, you should probably give me a call.

Have I missed something? Let me know!

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G↩︎

https://www.nytimes.com/2019/07/08/us/detroit-facial-recognition-cameras.html↩︎

https://www.goodreads.com/book/show/28186015-weapons-of-math-destruction↩︎

The most undervalued skill in delivering value with data science teams is picking projects that are likely to succeed. There is no shortcut - it takes years of hard earned experience.

A number that seems to be floating around is 80% of data science projects will FAIL. Ouch.

Many of these types of numbers are ‘predictions’ from consultancies who stand to benefit from making big claims.

“Through 2022, only 20% of analytic insights will deliver business outcomes.”

https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/

Cited reasons to fix this include:

- Getting better data
- Getting a Data strategy and governance framework
- Hiring smarter people
- Something..something DataOps, AI

These are all lovely ideas, but moving the lever on these are often impossible or impractical.

- Change your mindset (and how you run projects)

Data analytics is an exploratory and scientific endeavour that isn’t supposed to succeed every time. Just like not all lab experiments yield positive results. Instead of lamenting failures, develop a mindset of innovation and agile working where new ideas are prototyped and investment in R&D promoted but capped and balanced.

- Pick better projects

A question I get all the time, is how to get started with data science projects in an established business. Often there is a disconnect between those doing the work and those deciding what to do. The most undervalued skill in delivering value with data science teams is picking projects that are likely to succeed. There is no shortcut - it takes years of hard earned experience and it requires a balance of hands-on technical skills, with commercial awareness.

We have a dedicated program for businesses looking to get started or deepen their data analytics journey. We can help change your attitude and pick better projects.

While many (including me) have leveled a fair amount of criticism towards such solutions, I thought it would be worth seeing what the fuss was about.

Could I go head-to-head on the same predictive modelling challenge and compete with the might of Microsoft’s AutoML solution? Even worse, would I enjoy it? Even more worse, could I win??

**Objective**: Create the most accurate time series forecasting model

**Data Source**: Half-hourly electricity demand for Victoria, Australia^{1}

**Training data**: 51,120 records from 2012-01-01 to 2014-11-30

**Test data**: 1488 records from 2014-12-01 to 2014-12-31

**Method 1**: Use Microsoft Azure’s Automatic ML product.

**Method 2**: Hand code a statistical time series model in R

names | type | description |
---|---|---|

Time | datetime | Time stamp |

Demand | double | Target Variable: Electricity Demand |

Temperature | double | Temperature for the day |

Date | date | Date |

Holiday | logical | Was it a holiday date? |

The process to set up a new AutoML job was very easy and assumes you are working under somewhat sanitized conditions (which I was in this case).

Once you kick it off, it chugged away for an hour and 33 minutes. To my horror, I realized it takes the ‘kitchen sink’ approach and fits a suite of 41 (!) different machine learning models at the training data. Hyperparameter tuning is done by constructing a validation set using K-Fold cross validation.

The best performing model is then selected and then predictions are run on the test set. It’s a little concerning that Test set evaluation is only in ‘Preview’ mode. It was also very confusing to dig out the results on the test set. Most of the metrics prominently displayed are overly confident in-sample accuracy results.

The winning model in my case was a ‘Voting Ensemble’ of three models

- MaxAbsScaler, ExtremeRandomTrees
- StandardScalerWrapper, XGBoostRegressor

- StandardScalerWrapper, LightGBM

Overall the process was very easy and user friendly. It look a long time to train, but I didn’t have to think about anything - at all (which is usually time consuming) so overall it was a quick solution. I trained the model on a Standard_DS11_v2 (2 cores, 14 GB RAM, 28 GB disk) compute instance which costs $0.2 per hour. So it cost money, but not much.

Performance evaluation to follow below…

The process for doing this myself involved much more thought and brain-effort. Here are some notes.

The data set is quite complicated as its sub-daily and has (probably) three seasonal periods (daily, weekly, yearly). There was also maybe some trend and outliers to deal with. The data set also contained covariates such as Temperature and Holiday indicators.

Due to the seasonal complexity many traditional statistical methods were not appropriate like straight ARIMA (autoregressive integrated moving average) and ETS (exponential smoothing). While STL (Seasonal and Trend decomposition using Loess) can handle multiple seasonal periods I wanted a method to handle the covariates (like Temperature and Holidays). My next step was to think of Time Series Linear Regression models. However, accounting for yearly seasonality with 30min data meant fitting 17,520 (2 * 24 * 365) parameters just for this seasonal period. Which seemed excessive.

For longer, multiple-seasonal periods, using Fourier terms can be a good idea. Here a smaller number of terms in a fourier series can be estimated to approximate a more complex function. This type of *Dynamic Harmonic Regression*^{2} can also handle exogenous covariates and we can even fit the model with ARIMA errors to account for the short term dynamics of time series data.

In fact, this very approach was outlined in the excellent *Forecasting: Principles and Practice*^{3} using this very same example data set. I decided to borrow (steal) the ideas of creating a piece-wise linear trend for temperature. I also went a bit crazy with encoding specific holiday dummy variables and some other tweaks.

Overall I found this method slow to fit, and not overly performant. I decided next to try fitting a Prophet^{4} model. Prophet is an open-source automated algorithm for time series forecasting developed by Facebook. It uses a Bayesian framework to fit complex, non-linear, piece-wise regression models. For complex time series data, it provides a decent, fast framework including exogenous variables, holiday and seasonal effects. I didn’t do any principled hyperparameter tuning, but I did fiddle around with the model a bit.

So who won?

The AutoML platform did :( , but only just. Below is the comparison of RMSE and MAPE. The AutoML is red, my predictions are in blue. I stuffed up over Christmas a bit, which admittedly is a tricky hold-out month for testing.

Method | Metric | Value |
---|---|---|

Azure AutoML | RMSE | 213 |

Azure AutoML | MAPE | 3.56 |

Me | RMSE | 274 |

Me | MAPE | 4.96 |

So overall it was pretty close, but in terms of pure predictive performance, the AutoML platform did pip me at the post. Admittedly, the solution I arrived at was probably more of an ML solution than a ‘classical’ time series method given it is still an automated algorithm. If I had more time and patience I probably could have pursued a more complex regression model. In fact in *Forecasting: Principles and Practice*, the authors also cite the performance of a straight Dynamic Harmonic Regression is limited, however they go on to propose other innovative approaches^{5}^{6}, including splitting the problem into separate models for each 30min period and using regression splines to better capture exogenous effects. So it can be done, but not without a huge amount of effort.

This all led me to think: If the data are quite complex for a time series problem, then of course a more Machine Learn-y solution would outperform. I wonder what would happen if we repeated the same exercise but with many fewer data points and some quirky time series characteristics.

My hypothesis is, the machine learning models will not have sufficient data to fit well. On the other hand, my experience and gestalt will enable me to select and encode a statistical model that is appropriate and gain an edge on a black-box type of solution.

**Objective**: Create the most accurate time series forecasting model

**Data Source**: Monthly Medicare Australia prescription data^{7}, Anatomical Therapeutic Chemical index classification A10

**Training data**: 163 records from Jul 1991 to Jan 2005 (black line)

**Test data**: 41 records from Feb 2005 to Jun 2008 (grey line)

names | type | description |
---|---|---|

Cost | double | Cost of the scripts in $AUD |

Month | double | Month time stamp |

Here we have less than 200 data points, but we can visually inspect the time series and see that there is a clear trend, the process is multiplicative and there is a single, yearly seasonal pattern.

The AutoML platform again used a Voting Ensemble, churned out in 43 minutes, but this time using:

- ProphetModel (it must have copied me from last round ;))

- Exponential Smoothing

Given the multiplicative process here, I modeled the log transformed data. (I did try a more generalized Box-Cox transformation, but got better performance with a straight natural log transform). I tried an ARIMA model, using model selection via the Hyndman-Khandakar algorithm^{8}, which resulted in a `ARIMA(2,0,1)(1,1,2)[12] w/ drift>`

.

Yay! I won this round. Quite easily.

Method | Metric | Value |
---|---|---|

Azure AutoML | RMSE | 2.43 |

Azure AutoML | MAPE | 9.22 |

Me | RMSE | 1.63 |

Me | MAPE | 7.23 |

Well, I call it a draw.

Here are some of my closing thoughts from this experiment.

An ML solution might be a good choice if:

- You have lots of data

- You care a lot about prediction

- You don’t have to be too transparent

- Interpretation is not very important

- You have a very complex time series data set

I would caveat this with not just blindly modelling your problems away. You still need to understand the process to ensure your predictions are well calibrated and you don’t fall prey to over fitting.

A more classical statistical modelling approach might be a good choice if:

- You want a more flexible framework

- You need to / want to encode domain knowledge

- You want a more interpretable model
- You have fewer data

The good news is, if you are sufficiently smart and motivated (which I am sure you are) you can certainly compete in terms of model performance with an ML solution, even on complex problems. The bad news is, it’s harder and you need to think a bit. You can’t just delegate all your thinking to the machines. Not yet anyway.

O’Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022). *tsibbledata: Diverse Datasets for ‘tsibble’*. https://tsibbledata.tidyverts.org/, https://github.com/tidyverts/tsibbledata/.

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2023-06-05.

Thanks to the Tidyverts team https://tidyverts.org/. The new an improved time series stack in R makes all this so easy.

**Note**: None of this was super-rigorous, and I certainly tilted the board in my favour here and there. It was just fun and a chance to play around with a tool that I have previously avoided for no real reason.

Source: Australian Energy Market Operator;

*tsibbledata*R package↩︎Young, P. C., Pedregal, D. J., & Tych, W. (1999). Dynamic harmonic regression. Journal of Forecasting, 18, 369–394. https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-131X(199911)18:6%3C369::AID-FOR748%3E3.0.CO;2-K↩︎

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2023-06-05.↩︎

Taylor SJ, Letham B. 2017. Forecasting at scale. PeerJ Preprints 5:e3190v2 https://doi.org/10.7287/peerj.preprints.3190v2↩︎

Fan, S., & Hyndman, R. J. (2012). Short-term load forecasting based on a semi-parametric additive model. IEEE Transactions on Power Systems, 27(1), 134–141. https://ieeexplore.ieee.org/document/5985500↩︎

Hyndman, R. J., & Fan, S. (2010). Density forecasting for long-term peak electricity demand. IEEE Transactions on Power Systems, 25(2), 1142–1153. https://ieeexplore.ieee.org/document/5345698↩︎

Source: Medicare Australia;

*tsibbledata*R package↩︎Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(1), 1–22. https://doi.org/10.18637/jss.v027.i03↩︎

**What is the forecast demand for the next week?**

Who knows? I certainly don’t know, despite doing forecasting for a job.

In fact, anyone who tells you they do know (with certainty) is wrong.

Often you have to come up with something. Finance needs an estimate of sales or budgets, or your manager needs something to take to the board. So you have to churn out some numbers. But what numbers?

Here are two extremes:

- You could ignore all historical data and just use the most recent data point as the most reliable forecast.

- You could ignore any specific value and just take the average of everything.

Both are in the ballpark but they aren’t very squiggly like the real data. They clearly aren’t capturing the *seasonality*.

You could be lazy (clever) and just use the same time period from *last* week as your current forecast. It actually looks really good. But it’s just one, fixed realization of what could happen. Will it play out exactly like your forecast? *Exactly* like last week? Almost certainly not.

Let’s simulate another way history could play out. (By statistically resampling the residuals in the training data)

And another

And 20 more:

So instead of blindly relying on one (kind of okay looking, but totally unrealistic) forecast, we now have a ‘fuzzy’ region of plausible values.

If you don’t know with certainty what your forecast is going to be, then don’t just give one concrete number. It’s misleading.

Its more important to know how uncertain your forecast is rather than what your forecast is.

You can still pull out a mean or point estimate but by delivering the whole story you are conveying not just what you think is likely to happen, but how certain you are about it.

Often this can lead to more meaningful discussions. Perhaps it’s not the mean of your forecast distribution that you care about, its the extreme values. For example, If I was stress testing business cash flows and forecasting the cost of a maintenance activity, I’d be more interested in forecasting the 95% upper limit of forecast costs rather than the ‘expected’ cost.

In practice, you often don’t have to do any mathematical simulations. Proper time series forecasts have methods to calculate prediction intervals out of the box (see below).

Want to know what actually happened? Here it is in red.

A huge criticism of providing probabilistic forecasts is that is seems like you are ‘hedging your bets’ and being evasive. The reality is, in our electricity example, it was just very hot and demand surged (much higher than even our 95% prediction interval). So if it’s realistic and plausible to see forecast values in these ranges (or even more extreme) - Why wouldn’t you want to include that information in your forecast??

In practice a statistician will likely use a more sophisticated model than presented here. These models may take into account temperature and other factors but there will still be unexplained variance that will need to be quantified if you want a quality forecast produced.

**If you or your organisation want to get serious about making proper forecasts and being proactive when making critical decisions - drop me a line, I can help.**

O’Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022). *tsibbledata: Diverse Datasets for ‘tsibble’*. https://tsibbledata.tidyverts.org/, https://github.com/tidyverts/tsibbledata/.

Source: Australian Energy Market Operator.

*tsibbledata*R package.↩︎

Randomness, like time or space, is one of these deep concepts that are super hard to reason about. Despite this, it’s fairly common to see random number generators in practice. A casino will use one in their gaming software to randomise outcomes; A lottery or competition website will use one to pick winners; Scientists use them to run simulations and cryptographic applications are powered by some form of randomness.

Flash back to the 1980’s where down-on-his-luck unemployed ice cream man Michael Larsen cracked the (non-random) pattern in TV game show *Press Your Luck* and took them for over $100k. Sadly it didnt end well for Larsen with Ponzi schemes, radio station challenges awry and a break and enter. But it goes to show what can happen if you don’t take randomness seriously.

Randomness is an actual or apparent lack of pattern, but it’s kind of hard to test and even its very nature is somewhat unclear. In 1903, a British mathematician called Frank Ramsey was born. Ramsey was a militant atheist, but interestingly his brother went on to become Archbishop of Canterbury. He went on to study mathematics and economics, becoming a student of famous economist John Maynard Keynes. Somehow he ended up also translating a German book of logical philosophy into English and joined a secret intellectual society just after the war. A minor discovery of his ended up blossoming into what is known as Ramsey theory, which is a theory in mathematical combinatorics showing that in any sufficiently large system, however disordered, there is always *some* order. This has had interesting (and conspiratorial) implications for whether there is even such a thing as ‘random’. Oh and by the way, all this happened before he died at age 26 after complications from liver disease likely caused by swimming in a river.

Generally RNG’s can generate **True** random numbers or **Pseudo** random numbers. True RNGs generate random bits from natural stochastic sources like background radiation, quantum effects, atmospheric noise etc. Next time you are tempted to toss a coin, perhaps head over to random.org instead for some ‘true’ randomness.

There is a fun history lesson for how random.org got started with true RNG’s generated using random static from a cheap $10 radio, laden with a post-it note advising passers by not to fiddle with the knobs.

Pseudo-random numbers are generated using a ‘seed’ that deterministically produces numbers that look random, but can be entirely reproduced from the initial seed condition. This is often useful (and used by me all the time) when you need a random sample, but you need it to replicated exactly for scientific reproducibility purposes.

Given randomness itself is hard to test, there are a number of statistical test suites that perform a battery of diagnostics on a large sample of random numbers in order to test various aspects of randomness. One prominent test suite for cryptographic random bits is developed by NIST which uses 15 different statistical tests.

- The Frequency (Monobit) Test
- Frequency Test within a Block
- The Runs Test
- Tests for the Longest-Run-of-Ones in a Block
- The Binary Matrix Rank Test
- The Discrete Fourier Transform (Spectral) Test
- The Non-overlapping Template Matching Test
- The Overlapping Template Matching Test
- Maurer’s “Universal Statistical” Test
- The Linear Complexity Test
- The Serial Test
- The Approximate Entropy Test
- The Cumulative Sums (Cusums) Test
- The Random Excursions Test
- The Random Excursions Variant Test

So, like much of the mathematics behind every day scenarios there is a fascinating history and deep technical and philosophical implications. Given what is on the line for organisations relying on randomness, its useful to engage a specialist to help run and interpret these test suites.

And remember, if you get it wrong, someone unemployed ice-cream man is just waiting to swoop in and take advantage.

Here I’ll cover some options to deploy this environment to the cloud so you can access it anywhere.

A common pattern is to create a Virtual Machine (VM) with a cloud service provider (such as AWS, Azure, GCP) and run your code there. I’ll cover an example using Microsoft Azure.

- Deploy a VM with an Ubuntu operating system. Go ahead and choose the compute power you need.

- Configure a custom network rule to allow traffic on port 8787 for RStudio

3. Log into your new VM terminal using SSH

Install Docker Engine by following these steps

Clone and Deploy the docker container from Step 2 in my guide.

The above is fine, but arguably if you are setting up a VM from scratch for development purposes I’m not sure what benefit there is from using a docker container. You may as well just directly install what you want and consider the VM a ‘container’.

However, if you plan to make this available to other users in your organisation, or to adapt this guide for Shiny App development you may be interested in other features such as TLS/SSL security, scale up, advanced networking, continuous integration, continuous deployment, staging/production deployment slots etc. This represents a shift from development sandpit to ‘web app’. For this case, Azure App Service may be a lower hassle option. This is Microsoft’s enterprise grade, web app deployment managed service.

In the **Virtual Machine** model you are setting up compute infrastructure, deploying and running containers directly - then fiddling with the infrastructure layer for everything else. In **App Service** you deploy your custom docker container (here containing RStudio Server) to Azure Container Registry (kind of like DockerHub). Azure App Services then builds and serves your app from there - without you having to stand up and manage an Infra layer directly.

Create Azure Container Registry (ACR) (or some other Docker repository) using this help guide

Run and test your container locally

Deploy your local container to ACR using this help guide

Create a new web app in Azure App Services using this help guide

Configuration:

- I didn’t have to fiddle with ports, presumably it reads the exposed ports in the docker file and does this magically.

- For custom environment variables like the RStudio Server password, I had to manually add this in the config section.

and it worked just fine:

However it is often the case that users are operating in a restricted computing environment, such as in a corporate or government setting. Alternatively you may wish to create a custom development environment to test or replicate some other specific setup. This is a good case to move away from locally managed software to containerization, such as Docker.

I have set up a Github repository that sets up a local data science development environment in the browser.

It builds a docker container including:

- Ubuntu 20.04 LTS
- R version 4.2
- RStudio Server 2022.02.3+492
- All tidyverse packages and devtools
- tex & publishing-related package

The image builds on the rocker/verse image from Rocker Project.

Some other enhanced configuration options are included in the Dockerfile, such as preloading you RStudio preferences to get the same look and feel you have locally, the option to install other CRAN packages & mounting local volumes to persist your work locally.

Go here for Step by step instructions:

]]>The typical choice when calculating binomial proportion confidence intervals is the asymptotic, or normally approximated ‘Wald’ interval where success probability is measured by:

In many settings, such as marketing analytics or manufacturing processes the sample proportion is close to 0 or 1. Evaluating asymptotic confidence intervals near these boundary conditions will lead to underestimation of the error, and in some cases producing an interval outside .

Fortunately other methods exist, such as Wilson’s score interval, exact methods and Bayesian approaches. The recommendation here is to examine the probability coverage and explore alternative methods for sample size and CI calculation, especially when the parameter is near the boundary conditions, or in cases of very small n.

```
library(binom)
library(tidyverse)
n <- 50
p <- c(0.01, 0.5, 0.99)
```

`x <- purrr::map_df(p, .f = ~binom.confint(x = n * .x, n = n, methods = 'all'))`

```
ggplot(x, aes(colour = factor(x))) +
geom_point(aes(mean, method), show.legend = F) +
geom_errorbarh(aes(xmin = lower, xmax = upper, y = method), show.legend = F) +
geom_vline(xintercept = c(0, 1), lty = 2, col = "grey") +
facet_wrap(~(x*2/100)) +
theme_bw() +
labs(title = "A variety of binomial confidence interval methods for p = 0.01, 0.5 & 0.99",
subtitle = "Note unusual behaviour near 0.01 and 0.99")
```

`cov <- purrr::map_df(p, ~binom.coverage(.x, n, conf.level = 0.95, method = "all"))`

```
ggplot(cov, aes(colour = factor(p))) +
geom_point(aes(coverage, method), show.legend = F) +
geom_vline(xintercept = 0.95, lty = 2) +
facet_wrap(~(p)) +
theme_bw() +
labs(title = "Probability coverage for a variety of binomial confidence interval methods",
subtitle = "Reference line at 0.95 coverage")
```

A good discussion is contained in:

Wallis, Sean A. (2013). “Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods” (PDF). **Journal of Quantitative Linguistics.** 20 (3): 178–208. doi:10.1080/09296174.2013.799918. S2CID 16741749.

https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

]]>There are widgets produced and we need to audit of them. Some sort of rejection threshold is needed on that sample to decide if the whole batch of widgets has met a specified quality level.

Typically, a binomial distribution would be appropriate for measuring the probability of successes (in this case defects found) in independent trials with probability .

The word *independent* is doing a lot of work here as it implies that we are sampling *with* replacement in order to maintain a fixed probability parameter .

In cases where you are taking draws from a population *without* replacement (such as when you do destructive inspections on a widget) the underlying population changes with each draw and so does the probability .

In this case, modelling the process using a hypergeometric distribution may be a more appropriate choice.

It similarly describes the probability of successes in draws without replacement. However, instead of specifying a parameter , we provide the population size , which contains success states in the population.

Let’s say we have 2000 widgets manufactured and we want to sample 50 (ignore why 50, that is a whole separate question). We have an assumed quality level of 10% defective units (which we define as ‘success’ for complicated reasons).

Q: Based on a sample of 50 widgets how many defective units would be considered unlikely (95% CI) to occur randomly given our assumed quality level, and therefore result in us rejecting the entire batch?

We can compare the binomial probability mass function with the hypergeometric and observe they are essentially the same.

`library(tidyverse)`

```
tibble(
x = seq.int(0, 50, by = 1),
binomial = dbinom(x, size = 50, prob = 0.1),
hypergeom_2000 = dhyper(x, m = 200, n = 1800, k = 50),
) |>
pivot_longer(cols = -1, names_to = 'distribution', values_to = 'density') |>
ggplot(aes(x, density, col = distribution)) +
geom_line() +
geom_point() +
xlim(c(0, 20)) +
theme_bw() +
labs(x = "Observed defective units in sample")
```

However, if we had a smaller population of say 100 or 70 widgets, how would this compare?

```
tibble(
x = seq.int(0, 50, by = 1),
binomial = dbinom(x, size = 50, prob = 0.1),
hypergeom_2000 = dhyper(x, m = 200, n = 1800, k = 50),
hypergeom_100 = dhyper(x, m = 10, n = 90, k = 50),
hypergeom_070 = dhyper(x, m = 7, n = 63, k = 50)
) |>
pivot_longer(cols = -1, names_to = 'distribution', values_to = 'density') |>
ggplot(aes(x, density, col = distribution)) +
geom_line() +
geom_point() +
xlim(c(0, 20)) +
theme_bw() +
labs(x = "Observed defective units in sample")
```

We can see these curves are markedly different. And indeed the 95% confidence intervals obtained are narrower for the hypergeometric case.

`qbinom(p = c(0.025, 0.975), size = 50, prob = 0.1)`

`[1] 1 9`

`qhyper(p = c(0.025, 0.975), m = 10, n = 90, k = 50)`

`[1] 2 8`

We can see from a random draw of 1 million samples from each PMF that they both have the same expected values, but the variance is smaller in the hypergeometric case.

```
X <- rbinom(n = 1e6, size = 50, prob = 0.1)
Y <- rhyper(nn = 1e6, m = 10, n = 90, k = 50)
mean(X)
```

`[1] 4.999079`

`var(X)`

`[1] 4.498633`

`mean(Y)`

`[1] 5.000942`

`var(Y)`

`[1] 2.273265`

As a consequence of removing samples in each draw we influence the probability of a subsequent success. If our and is very large relative to our sample this wont make much of an impact, but it can be impactful for smaller populations, or relatively larger samples.

From our example above, failing to use a hypergeometric distribution to model this process for smaller populations will result in wider, more conservative acceptance regions which can increase consumer risk in a manufacturing process.

Typical guidance on when to use each distribution is given in manufacturing standards such as *AS 1199.1-2003: Sampling Procedures for Inspection by Attributes* and typically involves how you structure your sampling scheme.

In the help page for `lme4::predict.merMod()`

is the following note:

- There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters; we recommend bootMer for this task.

There are some useful resources out there but it took a while to track down, so this post may serve as a good reference in the future.

Let’s go through an example using the famous `sleepstudy`

data showing the average reaction time per day (in milliseconds) for subjects in a sleep deprivation study.

```
library(lme4)
library(tidyverse)
data("sleepstudy")
```

We would like to model the relationship between `Reaction`

and `Days`

```
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point(show.legend = FALSE) +
theme_bw()
```

Fitting a basic linear model:

```
fit_lm <- lm(Reaction ~ Days, data = sleepstudy)
summary(fit_lm)
```

```
Call:
lm(formula = Reaction ~ Days, data = sleepstudy)
Residuals:
Min 1Q Median 3Q Max
-110.848 -27.483 1.546 26.142 139.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 251.405 6.610 38.033 < 2e-16 ***
Days 10.467 1.238 8.454 9.89e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.71 on 178 degrees of freedom
Multiple R-squared: 0.2865, Adjusted R-squared: 0.2825
F-statistic: 71.46 on 1 and 178 DF, p-value: 9.894e-15
```

```
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw()
```

But this ignores the fact these data are not independent. We have multiple observation per subject. Some look like a good fit, others not.

```
ggplot(sleepstudy, aes(Days, Reaction, col = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
facet_wrap(~Subject) +
theme_bw()
```

Let’s add a random intercept term for `Subject`

. For simplicity we will leave out any other random effects.

```
fit <- lme4::lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(fit)
```

```
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
Data: sleepstudy
REML criterion at convergence: 1786.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.2257 -0.5529 0.0109 0.5188 4.2506
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 1378.2 37.12
Residual 960.5 30.99
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.4051 9.7467 25.79
Days 10.4673 0.8042 13.02
Correlation of Fixed Effects:
(Intr)
Days -0.371
```

New fitted lines can be drawn, showing the adjusted intercept for each subject (original regression line kept for reference).

```
sleepstudy |>
mutate(pred = predict(fit, re.form = NULL)) |>
ggplot(aes(Days, Reaction, col = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1], col = "grey") +
geom_line(aes(Days, pred), show.legend = FALSE) +
facet_wrap(~Subject) +
theme_bw()
```

Let’s try and generate prediction intervals using `lme4::bootMer()`

as suggested.

First on the in-sample data.

```
# predict function for bootstrapping
predfn <- function(.) {
predict(., newdata=new, re.form=NULL)
}
# summarise output of bootstrapping
sumBoot <- function(merBoot) {
return(
data.frame(fit = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.5, na.rm=TRUE))),
lwr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.025, na.rm=TRUE))),
upr = apply(merBoot$t, 2, function(x) as.numeric(quantile(x, probs=.975, na.rm=TRUE)))
)
)
}
# 'new' data
new <- sleepstudy
```

Notes:

In the

`predict()`

function we specify`re.form=NULL`

which identifies which random effects to condition on. Here`NULL`

includes all random effects. Obviously here you can compute individual predictions assuming you feed it with the correct grouping level in your data.In the

`lme4::bootMer()`

function we set`use.u=TRUE`

. This conditions on the random effects and only provides uncertainly estimates for the i.i.d. errors resulting from the fixed effects of the model.

If use.u is TRUE and type==“parametric”, only the i.i.d. errors are resampled, with the values of u staying fixed at their estimated values.

`boot <- lme4::bootMer(fit, predfn, nsim=250, use.u=TRUE, type="parametric")`

```
new |>
bind_cols(sumBoot(boot)) |>
ggplot(aes(Days, Reaction, col = Subject, fill = Subject)) +
geom_point(show.legend = FALSE) +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
facet_wrap(~Subject) +
theme_bw()
```

However, this gets complicated if we want to make predictions for *new* subjects.

We can no longer condition on the random effects, as the new subject level will not have a fitted random intercept value. Instead we need to effectively make a population level prediction (i.e. set the random effect to zero.). This makes sense as we don’t know what the random effect ought to be for a given, unobserved subject.

But we don’t want the prediction interval to just cover the uncertainty in the population level estimate. If we are interested in individual predictions, how can we incorporate the uncertainly of the random effects in the prediction intervals?

Lets generate a new, unobserved subject.

```
new_subject <- tibble(
Days = 0:9,
Subject = factor("999")
)
```

We provide a new predict function that doesn’t condition on the random effects by using `re.form = ~0`

. This lets us input and obtain predictions for new subjects.

```
predfn <- function(.) {
predict(., newdata=new_subject, re.form=~0, allow.new.levels=TRUE)
}
```

```
new_subject |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw() +
ylim(c(150, 450))
```

However using `predict`

just results in a completely deterministic prediction as shown above.

An alternative approach is to use `lme4::simulate()`

which will simulate responses for subjects non-deterministically using the fitted model object.

Below we can see a comparison on both approaches.

```
predfn <- function(.) {
predict(., newdata=new_subject, re.form=~0, allow.new.levels=TRUE)
}
sfun <- function(.) {
simulate(., newdata=new_subject, re.form=NULL, allow.new.levels=TRUE)[[1]]
}
```

```
new_subject |>
bind_cols(simulated = sfun(fit)) |>
bind_cols(predicted = predfn(fit)) |>
pivot_longer(cols = c(3, 4), names_to = "type", values_to = "val") |>
ggplot(aes(Days, val, col = type)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
theme_bw() +
ylim(c(150, 450))
```

We can use this `simulate()`

function in our bootstrapping to resample responses from the fitted model (rather than resampling deterministic population predictions).

This time we set `use.u=FALSE`

to provide uncertainly estimates from both the model errors and the random effects.

If use.u is FALSE and type is “parametric”, each simulation generates new values of both the “spherical” random effects uu and the i.i.d. errors , using rnorm() with parameters corresponding to the fitted model x.

`boot <- lme4::bootMer(fit, sfun, nsim=250, use.u=FALSE, type="parametric", seed = 100)`

```
new_subject |>
bind_cols(sumBoot(boot)) |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject, fill = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
theme_bw() +
ylim(c(150, 450))
```

So while we don’t have a conditional mode of the random effect (because its a new subject) we can derive a bootstrapped estimate of the prediction interval by resampling the random effects and model errors on simulated data values.

For comparison, here is what the same prediction interval would look like if we just used an unconditional population prediction. While the overall gist is the same, despite also resampling both the random effects and the i.i.d. errors, the interval is narrower as it is resampling just the deterministic population predictions of the model.

`boot <- lme4::bootMer(fit, predfn, nsim=250, use.u=FALSE, type="parametric", seed = 100)`

```
new_subject |>
bind_cols(sumBoot(boot)) |>
bind_cols(predicted = predfn(fit)) |>
ggplot(aes(Days, predicted, col = Subject, fill = Subject)) +
geom_point() +
geom_abline(slope = fit_lm$coefficients[2], intercept = fit_lm$coefficients[1]) +
geom_line(aes(Days, fit), show.legend = FALSE) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, show.legend = FALSE) +
theme_bw() +
ylim(c(150, 450))
```

Most of the material and code is taken from a variety of sources below. In particular the lme4 github issue. Also, the `merTools`

package has a nice vignette comparing these methods with their own solution.

https://tmalsburg.github.io/predict-vs-simulate.html https://github.com/lme4/lme4/issues/388 https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#predictions-andor-confidence-or-prediction-intervals-on-predictions

Our directory contains a folder called `myapp`

which contains our shiny app file and other supporting files.

At the top level we have our dockerfile and other config files. These should be modified accordingly.

This directory structure can be cloned from my github repo

```
dockerised-shiny/
├── Dockerfile
├── myapp
│ └── app.R
├── README.md
├── shiny-server.conf
└── shiny-server.sh
```

This should be adapted as required.

```
# Using rocker/rver::version, update version as appropriate
FROM rocker/r-ver:3.5.0
# install dependencies
RUN apt-get update && apt-get install -y \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev \
libxt-dev \
libxml2-dev \
libssl-dev \
wget
# Download and install shiny server
RUN wget --no-verbose https://download3.rstudio.org/ubuntu-14.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://download3.rstudio.org/ubuntu-14.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb && \
. /etc/environment && \
R -e "install.packages(c('shiny', 'rmarkdown'), repos='$MRAN')" && \
cp -R /usr/local/lib/R/site-library/shiny/examples/* /srv/shiny-server/
# Copy configuration files into the Docker image
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY shiny-server.sh /usr/bin/shiny-server.sh
# Copy shiny app to Docker image
COPY /myapp /srv/shiny-server/myapp
# Expose desired port
EXPOSE 80
CMD ["/usr/bin/shiny-server.sh"]
```

To build the Docker image (called `myapp`

)

`docker build -t myapp .`

To run a container based on our Docker image:

This will run the docker image ‘myapp’ in a container (in detached mode) and expose post 80. It will name it ‘myapp’ and remove it when exited.

`docker run --rm -p 80:80 --name myapp -d myapp`

http://127.0.0.1/myapp/

`docker images `

`docker ps -a`

For individual containers add the container ID

`$ docker rm`

To remove all exited containers :

`$ docker rm $(docker ps -a -q -f status=exited)`

Remove all unused containers, networks, images (both dangling and unreferenced), and optionally, volumes.

`docker system prune -a`

`docker save -o ~/myapp.tar myapp`

```
docker load -i myapp.tar
docker run myapp
```

https://github.com/deanmarchiori/dockerised-shiny https://hub.docker.com/r/rocker/shiny

https://www.docker.com/get-started

https://www.bjoern-hartmann.de/post/learn-how-to-dockerize-a-shinyapp-in-7-steps/