Machine learning is not far off or far out. It is a practical business tool within the reach of any enterprise.
It is important to start by understanding how machine learning fits into the broader realm of analytics.
I often describe analytics as having four kinds of sight (no four-eyes jokes, please; I wear glasses):
- Hindsight – Report Business Results
- Oversight – Ensure Regulatory Compliance
- Insight – Discover Affinities, Correlations, and Causalities
- Foresight – Predict Future Behaviors and Events
Machine learning is about foresight. It is a prediction tool.
How Machine Learning Works
Although there are many different methods and subdisciplines, most forms of machine learning involve the same basic process. The machine or, more precisely, the learning software algorithm, examines data about past cases with known outcomes (called “training data”) and then builds a model that can be used to predict future outcomes.
Before the model is used, it is tested against a reserved set of additional known cases to determine its accuracy. For each test case, the model calculates a true/false or numerical value (classification or regression) for its predicted outcome and that is compared to the known outcome to get an error rate. Getting an acceptable error rate often requires tuning the algorithm by changing various parameters and retesting the data.
When the error rate is in bounds, the model can be used to evaluate new data. This is often referred to as “scoring.” New, unknown cases are submitted to the model and it returns a prediction for each case.
Spam filtering is a common example of this process. The learning machine is given identified examples of spam and non-spam messages and produces a model that is used by a spam filtering program to classify new messages as they come in.
To Date: Pricey but Real
Despite what many businesspeople might guess, machine learning is not in its infancy. It has come to be used very effectively across a wide array of applications, many listed here, but business applications have been in a small minority. Furthermore, the conventional business applications of machine learning largely have been the province of well-heeled stock traders, credit card issuers, and so on. Until very recently, machine learning was quite costly, which kept it out of the hands of many.
Machine learning is computationally intensive, requiring large amounts of computer memory, processing power, and data storage. Commercial-grade machine learning software is quite expensive and requires highly skilled data scientists to operate it. And finally, the machine learning process can be highly iterative, sometimes requiring many hours of trial-and-error guesswork by those expensive data scientists using those expensive computers.
In short, despite its great promise, machine learning has been a high cost, low productivity affair, not for the hoi polloi.
Machine Learning for Everyman
But that is changing very quickly. A combination of forces is altering the economics of machine learning and making it accessible and affordable even for very small companies.
- Cloud computing brings unlimited horsepower on demand for a few cents per node/minute.
- Software as a Service enables access to powerful software for less than a dollar per node/hour.
- Self-optimizing machine learning algorithms greatly reduce machine and data scientist time.
- Big data, from social, sensors, and clicks, creates new analytical and predictive opportunities.
Big data machine learning in the cloud can produce powerful predictive models very quickly and at a very low cost. You still need data scientists to work with stakeholders to provision the right training data and learn the right things from it, but they will be more productive and able to focus on data science rather than busy work and guesswork.
The good news is you don’t need data scientists to get started with machine learning and predictive modeling. There are a number of things you should do and learn before assessing what kind of data science skills you need. You will have an easier time of it if you focus on the following prediction preliminaries.
Assess your Assets
You can’t build a model without data, preferably lots of data. Before you go too far down any particular predictive path, you should assess your data assets. That is, business and IT folks should collaborate to find out what data might and might not be available to learn from.
There could be regulatory issues that will force complex and costly “data denaturing.” The data you can access may be missing important pieces, or you may not own the data you need most. The data you want might be expensive or difficult to get, as with, say, the Twitter firehose, or consumer geo-location data.
Start with data that you have lots of and know well, like call center records, historical research data, machine and sensor logs, or whatever is close to the data heartbeat of your business. Data doesn’t have to be costly, complicated, or valuable on its face to have predictive power.
Also, low-cost commercial and public databases like the US Bureau of Labor Statistics business census data can be used in conjunction with some of the data you already have to create useful training data. Open data initiatives from around the US and around the world are making more data more available all the time.
Find the Known Unknowns
Once you have a better idea of what interesting data you can easily access, you can then consider what it might help you predict. Focus on events, behaviors, and classifications: what will a customer do, when will something happen, and which thing is not like the others?
Your asset assessment may have turned up copious data about customers’ service and purchase histories that could help you accurately predict customer behavior in response to product changes. You may have easy access to inventory and purchasing data that could enable you to predict supplier behavior in relation to commodity price fluctuations.
Like the asset assessment, this step also benefits greatly from collaboration between the people who know the data and the people who know the business (and, of course, the data scientists, if you have them).
Prep for Prepping
Always treat your data architects as well as you treat your data scientists. The process of turning business transaction and reference information into the kind of data that the machine learning algorithms can understand and use is critical and it starts with data architecture. That is the domain of data formats, structures, locations, policies, relationships, and semantics.
Business data, especially when collected over long periods of time by a succession of systems and applications, can be very messy in ways that can impede machine learning. Missing values, columns with semantically inconsistent content, ambiguous use of nulls, and more—they’re all in there. The existing systems don’t gag on those things that often because they have been carefully instrumented with appropriate mechanisms for handling exceptions and avoiding problems. So, even in tight systems, the data can be quite loose, and for machine learning, you want tight data with everything true and in place.
A knowledgeable data architect is essential to preparing data for machine learning. He or she is the one who can map the swamps and high points in the data landscape and help design some of the fixes you will need in order to complete the data preparation.
Consultants and Cloud
Now you are ready for machine learning. You will need a data scientist or two, some software for them to use, and some hardware to run it on. For this I have two words: consultants and cloud.
For any given predictive model, once the training data is available and the machine learning software is tuned up properly, the rest can be quite mechanical and programmatic: refreshing the model with new training data and using the model to score new observations. You may only need a data scientist at certain times. There are many great freelance data scientists to whom you may be able to outsource training data and new development. This is especially true if you will be modeling things that do not change that often or that quickly.
Visit data science sites like http://www.datasciencecentral.com/, http://www.kdnuggets.com/, http://www.analyticbridge.com/, and http://www.datanami.com/ and you will find the crossroads of data science—discussions, news, blogs, jobs, ads, tutorials, and other key resources. Whether you want to buy or rent a data scientist, or just get a better idea of what data scientists do, those are great places to look.
And, speaking of renting rather than buying, cloud computing lets you pay for just the resources you need, just for the time you need them. For less than a minimum wage worker you can probably get a configuration large enough to process your largest possible training data sets.
Furthermore, if the machine learning software you use is built expressly for distributed parallel systems and delivered with usage-based pricing, it can be both highly efficient and very affordable, again with no up-front cost.
Machine learning and the predictive modeling it enables is approaching the reach of most medium to large businesses thanks to the economics of cloud computing, advances in scalable software, and the growing availability and variety of big data.
Many companies exploring predictive modeling are anxious to hire data scientists, thinking that they will hand the problem to them and get on with it, but those people end up sitting around because the company just wasn’t ready for them.
They should start with the data, not with the scientists. They should fully tap the deep data knowledge of data architects and others in IT to uncover their most valuable training data and get it ready for this new use.
Next, they should move to finding the business events that the data might be used to predict, such as customer churn, overtime expense, or optimum stocking levels.
Once the prediction goals are defined, it will probably be necessary to look for source data voids and flaws that can impact the speed or accuracy of the model and take some corrective measures to tighten up the data.
The best thing about machine learning today is that you can do it without buying computer hardware or software, or hiring expensive data scientists. You can rent it all, try things out, and see what works for you.
Tim Negris is VP of Marketing and Sales at Yottamine Analytics, a provider of high performance, cloud-based software services that enable data scientists to rapidly build highly accurate predictive models on-demand, with less trial and error, and at the lowest possible cost.