{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BAYa class Assignment 2020\n", "\n", "In this assignment, your task will be to implement and analyze algorithms for inference in probabilistic\n", "models described by factor graphs. In particular, you will implement Sum-Product algorithm (or Belief Propagation)\n", "and Max-Product (Max-Sum) algorithm to infer different probabilities, (marginal) distributions and values of variables for general factor\n", "graphs with categorical random variables as described in the slides on [Graphical Models and Inference](http://www.fit.vutbr.cz/study/courses/BAYa/public/prednasky/2-Graphical%20Models.pdf).\n", "More detailed information can be found in Chapter 8.4 in [Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)\n", "\n", "The preferred and easiest way of accomplishing this task is to complete this Jupyter Notebook, which already comes\n", "with a definition of the probabilistic model that you will work with.\n", "If you do not have any experience with Jupyter Notebook, the easiest way to start is to install Anaconda3,\n", "run Jupyter Notebook and open this notebook downloaded from [BAYa_Assignment2020.ipynb]\n", "(http://www.fit.vutbr.cz/study/courses/BAYa/public/notebooks/BAYa_Assignment2020.ipynb).\n", "\n", "The following cell contains a code with the definition of the probabilistic model for which you will run the inference algorithms. You should not edit this part!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "#Names of random variables in our model\n", "var_names=['State', 'Age', 'COVID', 'Party', 'Favorite', 'Voting']\n", " \n", "#Names of possible values (categories) for each random variables.\n", "val_names=[['Texas', 'California', 'NewYork'], #State\n", " ['Young', 'Old'], #Age\n", " ['Negative', 'Positive'], #COVID\n", " ['Democrats', 'Republicans'], #Party\n", " ['Biden', 'Trump'], #Favorite\n", " ['InPerson', 'Postal', 'NotVoting']]#Voting\n", "\n", "#The above variables with names are introduced just to make up a story around our model. Use these names\n", "# to interpret your results, but, in your 'generic' inference algirithms, avoid the use of these variables\n", "# and use only the definitions below!\n", "\n", "#Number of categories for each random variables.\n", "ncategories = np.array([len(v) for v in val_names])\n", "\n", "class Factor:\n", " \"\"\"\n", " Instances of this class represent individual factors in a factor graph.\n", " It is merely a structure of two member variables:\n", " 'vars': list of N integers, which are IDs of N random variables that this\n", " factor depends on. These integers can be seen as indices into 'var_names'\n", " 'table': potential function of the factor. Since we deal only with categorical\n", " variables, the potential function has form of N-dimensional array.\n", " The first dimension corresponds to the first variable in 'vars', the second\n", " dimension to the second variable, etc. The size of each dimension is given\n", " by the number of possible values of the corresponding variable\n", " \"\"\"\n", " def __init__(self, list_of_variable, potential_function_table):\n", " self.vars = list(list_of_variable)\n", " self.table = np.array(potential_function_table)\n", " \n", " # the number of table dimensions and the number of variables must match\n", " assert(self.table.ndim == len(self.vars))\n", " # the individual dimensions must match with the number of cathegories of the corresponding variable\n", " assert(np.all(ncategories[self.vars]==self.table.shape))\n", "\n", "\"List of factors defining our complete probabilistic model\"\n", "factors = [\n", "# P(State)\n", " Factor([0], [0.3, # Texas\n", " 0.5, # California\n", " 0.2]), # NewYork\n", "\n", "# P(Age)\n", " Factor([1], [0.6, # Young\n", " 0.4]), # Old\n", "\n", "# P(COVID)\n", " Factor([2], [0.7, # Negative \n", " 0.3]), # Positive\n", "\n", "# Texas California NewYork\n", "# P(Party|State,Age) Young,Old Young,Old Young.Old\n", " Factor([3, 0, 1], [[[0.4, 0.2], [0.9, 0.8], [0.8, 0.6]], # Democrats \n", " [[0.6, 0.8], [0.1, 0.2], [0.2, 0.4]]]),# Republican\n", "\n", "# P(Favorite|Party) Dem. Rep.\n", " Factor([4, 3], [[0.95, 0.2], # Biden\n", " [0.05, 0.8]]),# Trump\n", " \n", "# Democrats Republicans\n", "# P(Voting|Party,COVID) Neg. Pos. Neg. Pos. \n", " Factor([5, 3, 2], [[[0.5, 0.0], [0.7, 0.1]], # InPerson \n", " [[0.4, 0.9], [0.1, 0.4]], # Postal\n", " [[0.1, 0.1], [0.2, 0.5]]])# None\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above cell, we first introduce names of our random variables and their values to make up some \"story\" around our probabilistic model. Our model \"pretends\" to model the behaviour of a random US citizen during the recent US presidental elections. It contains the following random variables and their categorical values:\n", "\n", "* State: Does the random person come from 'Texas', 'California' or 'NewYork'? For simplicity, we pretend that USA does not have more states. \n", "* Age: Is the person 'Young' or 'Old'?\n", "* COVID: Is the person COVID 'Positive' or 'Negative'?\n", "* Party: Is the person supporter of 'Democrats' or 'Republicans'?\n", "* Favorite: Is the person's prefered president candidate 'Biden' or 'Trump'?\n", "* Voting: Is the person voting 'InPerson', using 'Postal' vote or 'NotVoting' at all?\n", "\n", "Next came the definition of the full model as a list of factors. In our model, the factors are taken from a bayesian network (see below). Therefore, each factor represents a (normalized) probability distribution for one of the 6 variables. Such variable is always the first element of 'vars' (see 'vars' in class Factor). If 'vars' has more than\n", "one element, then the factor represents a conditional distribution that is conditioned on the remaining variables.\n", "\n", "The following function 'evaluate_factor' can help us interpret the individual factors. Its arguments are a Factor instance and values for all the variables that the factor depents on (in the order given by 'vars'). The values are given only as integer IDs of the categories (i.e. second indices into 'val_names') The function evaluates the factor using the values. Finally, using the variable and value names, it prints out a string showing what (conditional) distribution is evaluated using what values and what is the resulting probability." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Voting=Postal|Party=Democrats,COVID=Positive) = 0.9\n", "P(Voting=NotVoting|Party=Republicans,COVID=Positive) = 0.5\n", "P(Favorite=Biden|Party=Republicans) = 0.2\n", "P(State=California) = 0.5\n" ] } ], "source": [ "def evaluate_factor(factor, *values):\n", " var_vals = ['='.join((var_names[var], val_names[var][val])) for var,val in zip(factor.vars, values)]\n", " print(\"P(\"+var_vals[0]+ ('' if len(var_vals)<2 else '|')+(','.join(var_vals[1:]))+\") =\", factor.table[values])\n", "\n", "evaluate_factor(factors[5], 1, 0, 1)\n", "evaluate_factor(factors[5], 2, 1, 1)\n", "evaluate_factor(factors[4], 0, 1)\n", "evaluate_factor(factors[0], 1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the examples of evaluated factors above, we can see that\n", "* 90% of COVID positive supporters of Democrats choose to vote using postal votes\n", "* 50% of COVID positive Republicans choose not to vote at all\n", "* 20% of Republicans find Biden to be more acceptable president\n", "* 50% of voters are from California\n", "\n", "Using all the factors, the following code construct graph showing the corresponding Bayesian Network. Each node is one variable. For each factor, we create edges into the first variable in 'vars' from all the remaining variables." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "State\r\n", "\r\n", "State\r\n", "\r\n", "\r\n", "Party\r\n", "\r\n", "Party\r\n", "\r\n", "\r\n", "State->Party\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Favorite\r\n", "\r\n", "Favorite\r\n", "\r\n", "\r\n", "Party->Favorite\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Voting\r\n", "\r\n", "Voting\r\n", "\r\n", "\r\n", "Party->Voting\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Age\r\n", "\r\n", "Age\r\n", "\r\n", "\r\n", "Age->Party\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "COVID\r\n", "\r\n", "COVID\r\n", "\r\n", "\r\n", "COVID->Voting\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from graphviz import Digraph\n", "dot = Digraph()\n", "dot.edges([(var_names[v] ,var_names[f.vars[0]]) for f in factors for v in f.vars[1:]])\n", "dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, our model naively assumes that the probability of being COVID positive does not dependent\n", "on the age or the state the person is from.\n", "\n", "Similarly, we can draw Factor Graph corresponding to our model:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "State_\r\n", "\r\n", "State\r\n", "\r\n", "\r\n", "State\r\n", "\r\n", "State\r\n", "\r\n", "\r\n", "State_--State\r\n", "\r\n", "\r\n", "\r\n", "Age_\r\n", "\r\n", "Age\r\n", "\r\n", "\r\n", "Age\r\n", "\r\n", "Age\r\n", "\r\n", "\r\n", "Age_--Age\r\n", "\r\n", "\r\n", "\r\n", "COVID_\r\n", "\r\n", "COVID\r\n", "\r\n", "\r\n", "COVID\r\n", "\r\n", "COVID\r\n", "\r\n", "\r\n", "COVID_--COVID\r\n", "\r\n", "\r\n", "\r\n", "Party_\r\n", "\r\n", "Party\r\n", "\r\n", "\r\n", "Party_--State\r\n", "\r\n", "\r\n", "\r\n", "Party_--Age\r\n", "\r\n", "\r\n", "\r\n", "Party\r\n", "\r\n", "Party\r\n", "\r\n", "\r\n", "Party_--Party\r\n", "\r\n", "\r\n", "\r\n", "Favorite_\r\n", "\r\n", "Favorite\r\n", "\r\n", "\r\n", "Favorite_--Party\r\n", "\r\n", "\r\n", "\r\n", "Favorite\r\n", "\r\n", "Favorite\r\n", "\r\n", "\r\n", "Favorite_--Favorite\r\n", "\r\n", "\r\n", "\r\n", "Voting_\r\n", "\r\n", "Voting\r\n", "\r\n", "\r\n", "Voting_--COVID\r\n", "\r\n", "\r\n", "\r\n", "Voting_--Party\r\n", "\r\n", "\r\n", "\r\n", "Voting\r\n", "\r\n", "Voting\r\n", "\r\n", "\r\n", "Voting_--Voting\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from graphviz import Graph\n", "fg = Graph()\n", "for f in factors:\n", " fg.node(var_names[f.vars[0]]+\"_\", var_names[f.vars[0]], shape=\"box\")\n", "fg.edges([(var_names[f.vars[0]]+\"_\" ,var_names[v]) for f in factors for v in f.vars])\n", "fg" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As was already said, each factor corresponds to (conditional) distribution of one variable. Therefore, we choose the factors (rectangle) node names to be these variables names. This Factor Graph has tree structure so application of Belief Propagation should be straightforward.\n", "\n", "The following cell implements the inefficient 'brute-force marginalization' approach, which might be useful to test correctness of your Belief Propagation implementation. When calling this function, we can specify values of some of the variables and the function marginalizes over all possible values of the remaining variables. It implements the operation\n", " $$\n", "\\DeclareMathOperator{\\xx}{\\mathbf{x}}\n", "\\sum_{\\xx_{marg}}\\prod_s f_s(\\xx_s)\n", "$$\n", "where the product is over all factors $f_s$, $\\xx_s$ is the subset of variables that $f_s$ depends on and $\\xx_{marg}$ is the subset of variables for which the sum sums over all their possible values (i.e marginalizes). For our model where the factors are the (conditional) distributions of individual variables, the product corresponds to the joint distribution of all the variables\n", "$$\n", "\\prod_s f_s(\\xx_s) = P(State) P(Age) P(COVID) P(Party|State,Age) P(Favorite|Party) P(Voting|Party,COVID)\\\\\n", "= P(State,Age,COVID,Party,Favorite,Voting)\n", "$$\n", "\n", "For example, if we supply the function with values for $Party$ and $Voting$ and marginalize over all other variables, we obtain the joint marginal probability\n", "\n", "\n", "$$\n", "P(Party,Voting)=\\sum_{State} \\sum_{Age}\\sum_{COVID}\\sum_{Favorite}P(State,Age,COVID,Party,Favorite,Voting)\n", "$$" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Z = 0.9999999999999999\n" ] } ], "source": [ "import itertools \n", "def brute_force_marginalize(value_list):\n", " \"\"\"\n", " observed_values is a list of values one for each variable. For values set to None,\n", " we marginalize over all possible values of the corresponding variable. For other\n", " values, we use the given value for the corresponding variables when evaluating factors.\n", " \"\"\"\n", " value_ranges = [range(n) if v is None else (v,) for n,v in zip(ncategories, value_list)]\n", " marginal_prob = 0.0\n", " # itertools.product let us iterate over all possible values of all variables\n", " for values in itertools.product(*value_ranges):\n", " joint_prob = 1.0\n", " for f in factors:\n", " joint_prob *= f.table[tuple(values[v] for v in f.vars)]\n", " marginal_prob+=joint_prob\n", " return marginal_prob\n", "\n", "print(\"Z = \", brute_force_marginalize([None, None, None, None, None, None]))\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "When calling 'brute_force_marginalize' in the cell above, we set the values of all variables to 'None'. This instructs the function to sum over all the values of all the variables. The result is 1, which proves that our model represents a well normalized joint distribution. In the following calls, we clamp some of the variables to specific values, which avoids marginalization over those variables." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P(Voting=InPerson) = 0.4061000000000001\n", "P(Voting=Postal) = 0.43120000000000014\n", "P(Voting=NotVoting) = 0.1627\n", "P(Party=Democrats) = 0.6699999999999997\n", "P(Party=Republicans)= 0.33000000000000024\n", "P(Party=Republicans,Voting=InPerson)= 0.17159999999999995\n" ] } ], "source": [ "print(\"P(Voting=InPerson) =\", brute_force_marginalize([None, None, None, None, None, 0]))\n", "print(\"P(Voting=Postal) =\", brute_force_marginalize([None, None, None, None, None, 1]))\n", "print(\"P(Voting=NotVoting) =\", brute_force_marginalize([None, None, None, None, None, 2]))\n", "print(\"P(Party=Democrats) =\", brute_force_marginalize([None, None, None, 0, None, None]))\n", "print(\"P(Party=Republicans)=\", brute_force_marginalize([None, None, None, 1, None, None]))\n", "print(\"P(Party=Republicans,Voting=InPerson)=\", brute_force_marginalize([None, None, None, 1, None, 0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above examples show that the marginal probabilities $P(Voting)$ and $P(Party)$ sum to 1 when summed over all their possible values. The marginal distribution $P(Voting)$ tells us what are the probabilities of different forms of voting (InPerson, Postal, NotVoting) and $P(Party=Democrats)$ says that $2/3$ of all the people are Democrats. \n", "\n", "We also evaluate the joint marginal P(Party,Voting) to see that out of all the people, 17.16% are Republicans that came to vote InPerson." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tasks and Questions\n", "Your task is to implement the Sum-Product algorithm allowing for much more efficient inference that the brute-force approach. Based on the following questions, you will use the algorithm to infer individual probabilities such as\n", "$$P(Age=Young,Favorite=Biden)=?$$\n", "$$P(Voting=Postal|Favorite=Trump,COVID=Positive)=?$$\n", "or the whole marginal distributions such as\n", "$$P(Voting|Favorite=Trump,COVID=Positive)=?$$\n", "$$P(Age,Favorite)=?$$\n", "For the distributions, you need to report probabilities for all the possible values of the variables in question, e.g.\n", "$$P(Age=Young,Favorite=Biden)=?$$\n", "$$P(Age=Young,Favorite=Trump)=?$$\n", "$$P(Age=Old,Favorite=Biden)=?$$\n", "$$P(Age=Old,Favorite=Trump)=?$$\n", "\n", "Use your implementation to infer the following probabilities or distributions and to answer the following question:\n", "\n", "1. For all the individual variables, use the Sum-Product algorithm to (most efficiently) infer and print their marginal distributions $P(State)$, $P(Age)$, $P(COVID)$, $P(Party)$, $P(Favorite)$, $P(Voting)$.\n", "2. What is the computational complexity of the inference from the previous point compared to the brute-force marginalization? Extend the code for the inference algorithms (both brute-force and Sum-Product) to count the number of additions and multiplications carried out during the inference (consider only those from the equations describing the algorithm). Report these number. Note that, with the brute-force approach you need to call the brute_force_marginalize once for every value of every variable in order to evaluate all the marginal distributions. Keep reporting such computational complexity comparisons also for all the following inference problems.\n", "3. Use the Sum-Product algorithm to evaluate and print the following conditional marginal distributions\n", "$$P(Voting|Favorite=Trump)$$\n", "$$P(Voting|Favorite=Trump,COVID=Positive)$$\n", "$$P(Voting|Favorite=Trump,State=Texas,COVID=Positive)$$\n", "4. Use the Sum-Product algorithm to (most efficiently) evaluate and print the following joint marginal distribution\n", "$$P(Party, Favorite)$$\n", "Hint: Note that there is one factor dependent on both these variables.\n", "5. Use the Sum-Product algorithm to evaluate and print the following joint marginal probability\n", "$$P(Party=Republicans,Voting=InPerson)$$\n", "6. How can we use the Sum-Product algorithm to evaluate the whole joint marginal distribution $P(Party,Voting)$?\n", "Note that in this case, there is no factor dependent on both these variables.\n", "7. Implement Max-Product (or Max-Sum) algorithm and use it to infer the most likely values of all the variables. In other words, find the values for all the variables that maximizes the joint probability\n", "$$P(State, Age, COVID, Party, Favorite, Voting)$$\n", "8. With observed values $Voting=Postal$ and $Favourite=Trump$, .use Max-Product (or Max-Sum) algorithm find the most likely values of all the remaining variables. \n", "9. Finally, what if we replace factor `Factor([2], [0.3, 0.7])` in our model with factor `Factor([2, 0], [[0.9, 0.8, 0.5], [0.1, 0.2, 0.5]])`? What does it represent? How does the corresponding Bayesian Network change? How does it complicate the inference? It is enought if you just answer these questions. You can also try to implement inference for the marginals of the individual variables for this modified model. Succesfull implementation will be awarded by a bonus point." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }