Explaining Artificial Intelligence (AI) Solutions
Before discussing this above topic, it is important to describe my definition of AI within the context of this article. AI is the use of neural net technology with its more recent developments occurring in the area of deep learning. Now let’s begin the discussion.
Experienced data scientists with many years of applying various solutions for specific business problems do not merely rely on the “performance” of the solution itself as the final deliverable. An equal deliverable is the ability of storytelling which is becoming a common term within the analytics vernacular. What does the solution mean and what are the key elements in generating this AI solution? The notion of “just trusting” the “black box” nature of AI is becoming increasingly more unacceptable. In fact, pending legislation by the European Union in May2018 purports to improve the transparency of AI solutions whereby fines would accrue to those organizations that are not in compliance with the law. https://www.nytimes.com/2017/11/21/magazine/can-ai-be-taught-to-explain-itself.html
Yet, the AI challenge of explainability is not necessarily new for the data scientist. A large part of our work is presenting reports and approaches that best “tell the story”. Albeit, it is a simple process if one is using decision trees or regression techniques. In decision trees, we have various forms of advanced mathematics such as CHAID, CART, information loss using entropy functions, etc. But the end solution results is branches and nodes with the end nodes representing the “final” model. See example below of a response model.
In our example above, the solution is easily explainable as comprising three key variables where tenure has the strongest impact on response followed by income and age. The actual final solution itself would comprise the four end nodes ranked in the following manner according to response rate performance.
1) > 2 years tenure and older than 40 years
2) > 2 years tenure and < 40 years old
3) < 2 years tenure and income <50K
4) < 2 years tenure and income > 50K
Using a more parametric approach such as regression, we would utilize logistic regression if the target variable is binary or multiple regression if the target variable is continuous. But rather than report on the actual equation itself, we actually create a report that depicts the % contribution of each variable within the model as well as the impact (negative or positive) on the target variable. The % contribution of each variable is actually calculated by determining the partial R2 of each independent variable and then dividing by the total R2 of the entire model. See example below of a simple response model to a loyalty card. Table 2
Model Variable |
Impact on Response |
% Contribution to Model |
Income |
Positive |
45% |
Tenure |
Negative |
25% |
Product Spend |
Positive |
20% |
Gender is Male |
Negative |
10% |
In our above example, income is the strongest variable within this model where higher income generates higher likelihood in acquiring a loyalty card. Meanwhile the weakest variable is male but indicating males are less likely to acquire a loyalty card. The actual relationship to response is determined by the sign of the coefficient related to that variable. As you can observe here, the table clearly indicates a story of who is most likely to acquire a loyalty card. But let’s take this concept to AI or neural networks where we have input layers, hidden layers, and output layers as well as nodes for each layer which is displayed in a network. With more nodes and layers, the network becomes increasingly complex. See below example of neural net architecture.
The use of AI under the guise of deep learning can involve complex relationships between variables where the output is not as easily explainable as demonstrated in decision trees or regression analysis. There is some literature albeit academic which has attempted to provide some potential solutions in this area. One such solution is to conduct a regression analysis of the input variables versus the dependant variable which is the actual neural net output score. We can then use the partial R2 technique as demonstrated in Table 2 and the sign of the regression coefficient to determine the % contribution of each variable within a model as well as its relationship to the dependant or target variable. But what is the fundamental flaw here? Since we are using regression techniques here, we are ultimately assuming there is linearity between the input variables and output scores. Yet, the power of neural net technology is its ability to build potentially stronger performing solutions without any assumption regarding the distribution of the data such as linearity. Using a linear tool to explain a non-linear solution seems to be the ultimate contradiction.
Another option which warrants some consideration and merit is a methodology proposed by G.D.Garson Garson, G.D. 1991. Interpreting neural network connection weights. Artificial Intelligence Expert. 6(4):46-51. In a nutshell, the methodology utilizes weights of both the input variables alongside the weights of the output score. Let’s delve into this a little more closely.
In step 1, the input weights of each variable within each node are multiplied by their output weights in the first table. For example, the input weight for variable A in node 1 is -5.513 which is then multiplied by its output weight of .665 to yield the absolute value of 3.663 which is the value in the upper left hand side of the second table (variable A- node 1). This same calculation is then done for each cell in the second table.
Step 2 attempts to determine the % importance of each variable within each node. This is done by calculating the cell value in the specific row divided by the sum of all the values within that row as seen in Table 2. For example, variable A’s importance in node 1 is 3.663/ (3.663+1.444+1.454+2.281) =.414 which is the value in table 3 for the upper left hand side of the table (variable A-node 1). This same calculation is then done for each remaining cell in table 3.
Meanwhile step3 attempts to define the % importance of each variable overall. For example, in table 3, we sum up all the node values for a given variable. For example, the node totals for variable A is the .414+.217+.061+.360) =1.051. This calculation is then done for each variable in table 3 with the relevant variable values being depicted as column totals. These variable column totals are then transferred to table 4 where we simply calculate the % contribution of the given variable as the value for a given variable divided by the total. For example, variable A’s overall contribution in table 4 would be 1.051/4 which is 26.3%. In the example here, the largest variable contributor to the neural net model would be variable C with a contribution of 30%.
As indicated above, this methodology is very intriguing but was developed many years ago. In light of the very recent and exciting developments in deep learning, we are currently exploring whether this approach is still valid today. Specifically, more work needs to be done in adopting this approach in evaluating variable weights but within a much more complex environment. With the huge investment of resources and dollars in AI, solutions will become more mainstream. Yet, the need to understand the solution is even more prevalent today. A simple “trust the black box solution” does not allow an organization to understand why a given solution worked or did not work. Complexity of AI or deep learning solutions will always be a mitigating barrier to business understanding. But having said that, perhaps future funding and resources within AI might be devoted to this area of “business understanding”. This would then allow businesses to stay within their comfort zone of being more confident about a given business solution.
Richard Boire is Senior Vice President at Environics Analytics.