Introduction:

Multi Armed Bandit (MAB) algorithms can be applied in any type of decision making problem statement. A simple formulation of such problem statement is dynamic pricing of products.

  • A user comes on an e-commerce platform.
  • The platform chooses to put a product on sale at certain price.
  • User either makes a purchase at that price or doesn’t.
  • Objective of the platform is to maximize the total profit from purchase of the product.

In a MAB set-up, you can consider all the different price points for product as arms (or actions), let’s say there are $K$ of them. Every time a user comes on the platform, a new round is started, we observe till $T$ such rounds. In each round algorithm will select a price and show it to the user. In return, it observes a reward based on user’s behaviour. If the user makes a purchase, revenue becomes the reward. If the user doesn’t purchase, there is no reward. The algorithm can observe conversion only for the price the product was offered in each round. Therefore, it needs to try out different price to see which results in the maximum revenue. This presents us with a classical dilemma of exploration - exploitation.

With more trials and right balance between exploration - exploitation, the algorithm will converge to a solution that maximizes the revenue.

Auxiliary Feedback: In addition to just the observed rewards from selected arms, there might be other feedback signals which could be of use. In the above problem, we know that the user would have converted if price was lower than the selected price in the round.

Contexts: Select one action might not be optimal for any situation. A user who is likely to purchase can be offered the product at higher price compared to a user who would make a purchase only at lower prices. The price would also be dependent on the type of product as well. Including a context gives the algorithm a different high-level objective to do look at context before selecting the arm.

Global Constraints: There might be conditions which need to be satisfied while selecting an arm. In our example, there could be an upper limit to the number of times users can convert at the lowest price (budget).

Structured Actions: The arms could have some structure, not just selecting the correct price for one product but selecting prices for multiple products in a flash sale.

Stochastic Bandits:

Algorithm:

- for round t in total rounds T
	- pick arm a_t from set of arms A
	- observe reward r_t for the chosen arm a_t

Regret: To understand if the algorithm’s performance over time we look at a metric called Regret. We can understand how the algorithm is performing at $t$ by comparing it with a policy which would have selected the best action for all the rounds.

R(T)=μ.Tt=1Tμ(at)R(T) = \mu^*.T - \sum_{t=1}^{T}{\mu(a_t)}

where,

​ $\mu(a_t)$ is the mean reward of selecting arm $a$

​ $\mu^* = \arg \max_{a}{\mu{(a)}}$ is the mean reward of the best performing arm

Since, $R(T)$ depends on randomness in algorithm and reward distribution we take expectation over it called expected regret $\mathbb{E}[R(T)]$

Uniform Exploration First

- first N rounds (exploration phase):
	- pick arm a_t from set of arms A in uniform random manner
	- observe reward r_t for the chosen arm a_t
- after N rounds (exploitation phase):
	- select arm a_t with highest average reward

TODO: regret analysis