With much of the latest discussion focussed on the latest techniques in machine learning and in particular deep learning, the significant benefits of both are now a public reality. Yet, machine learning in effect represents the predictive analytics techniques that have been used for many years by data scientists. Furthermore, data scientists and their end users have always recognized the huge economic advantages of predictive analytics. But the significant advances of deep learning in the last 5 years have just expanded the application of predictive analytics to other areas which were technically not feasible at the time. The market for these solutions is huge and the competition is fierce. This has resulted in companies providing automated solutions which incorporate all the latest machine learning technologies. In today’s environment, speed and ease of use are the critical requirements of any machine learning software company.
But as I have stated in previous articles, what happens to the data scientist who is typically equipped with very advanced mathematical and statistical skills in this new paradigm. Do their skills become redundant in an analytics environment which is become increasingly automated. Will there be a reduction in demand for data scientists? Rather than the demand for data scientists diminishing, the opposite will occur but with a refocus on other skills which relate to problem identification and creating the analytical or architecting the data. Yet, this increased refocus will in effect emphasize more of the “art” in data science which is certainly not a new phenomenon for many data scientists. But what do I mean by this?
Thinking of our brains as ”left” brain vs. “right” brain helps to better understand the role of a data scientist. For example, a person who tends to be more "left-brained" is often said to be more logical, analytical, and objective. In other words, the current emphasis towards programming and mathematics skills would seem to be more left-brain oriented. Meanwhile a person who is "right-brained" is said to be more intuitive, thoughtful, and subjective. In other words, there would appear to be a more creative bent with these individuals or a more “art” component in solving a given business problem.
In a world of increasing automation as indicated above, the skillset of the data scientist will evolve as demand will shift from the more technical type requirements to the so-called softer skills of applying their data science knowledge to solving business problems. In this evolution, the right-brain of the data scientist will be emphasized as data scientists need to exercise more of their creative skills as they attempt to use apply their knowledge to a myriad number of business problems. The data scientist will still need to have a deep understanding of the technical side but more on understanding output rather than generating output.
We are observing today the growth of these type of so-called hybrids who are well-versed on the more technical aspects of data science but who also demonstrate strong capabilities on the “softer” business skills or the “art” side of data science. The demand for these hybrids will continue to accelerate as expectations will increase towards solving more business problems in an increasingly automated environment.
To provide some perspective of what this really means in practice, let me highlight just a few examples of how the “right” side of the brain or the “art” component is used within the data science discipline.
Even in the first stage of data science which is identifying the business problem, the creative nature of the data scientist is used to better define the business problem. For example, the business team may identify the need for a predictive model to identify those customers that are most at risk of defection. Yet, the data scientist understands that over 50% of the customer base is inactive. He or she might then suggest that the real problem should be to identify high risk defectors that are high value. The problem might also be framed on how marketing can derive the most impact from a retention program. In other words, the marketing team wants to optimize their efforts on saving these high risk high value customers. In this case, a simple retention model is no longer sufficient as a net lift model can truly optimize those high-defectors who are likely to be saved through a marketing campaign.
We have now defined the problem of retention but let’s continue to explore the modelling of retention where the “art” or right side of the brain continues to be used but now in the creation of the analytical file. One might think that once the problem is defined, the technical side of programming the data to create the analytical file would be the exclusive demand. Of course, the technical and programming side is a very critical component during this phase. But in creating the analytical file, one key requirement is the creation of the target variable of retention. How would one program the target variable of customer retention for a grocer vs. a credit card company? Unlike response models where the data scientist can specifically code for the target variable of response based on certain data fields, there is no one piece of data on any of the raw source data that specifically defines retention. Instead, the data scientist needs to be proactive in identifying an approach that would capture retention behaviour. The approach is utilizing the strength of the data scientist’s analytical skills as well as the domain knowledge of retention. The domain knowledge would emphasize that retention is all about purchase behaviour. In defining retention, one needs to understand the typical purchase period which will be dependent on the business and of course the industry. For example, the average purchase period for a customer spending on groceries would be very different than the average purchase period for a customer spending on their credit card. In defining retention for a customer purchasing groceries, it might be one week while defining retention for credit card usage, it might be three months. But in both cases, an analytical approach is established to help determine the appropriate time periods.
The development of fraud models is similar to the development of retention models in that we need to utilize both domain knowledge and the appropriate analytical approach. The same challenge of definition exists for fraud as it does for retention. No specific data field pertaining to fraud exists on any database. Instead, the data scientist has to explore the data for patterns and insights that might appear to look fraudulent. And of course, this will differ from industry to industry. Assessment of fraudulent behaviour in insurance will be very different than the assessment of fraudulent credit card activity. But again the “right” side of the brain is being used to arrive at an analytical process that identifies what “fraudulent” behaviour is. However, once this fraudulent behaviour is identified, we then use the data to in building models that predict the likelihood of this behaviour. In both fraud and retention, one could actually state that the behaviours being identified are quasi or pseudo measures rather than direct measures sourced directly from the database. But this phenomenon of utilizing quasi or pseudo measures as our target variables is often the norm rather than the exception in building many predictive models. In fact, this scenario is growing as we are seemingly exposed to more business problems despite access to more data.
The demand for this “right” brain thinking is increasing and in era of increased automation, the need for the “art” of data science will be the increasing cry of business.
Author: Richard Boire, Senior Vice President at Environics Analytics