Dark Worlds Challenge

Posted on February 08, 2018

Observing Dark Worlds was a Kaggle competition from 2012 the goal of which was to develop algorithm for predicting location of dark matter. To detect dark matter aggregations in deep space, astrophysicists inspect the change in visual appearance of background galaxies. Dark matter distorts spacetime and causes light to bend when passing through forming the so-called halos. As a result, background galaxies appear more elongated than they actually are, an effect known as gravitational lensing. The observed ellipticity of galaxies correlates with the position of dark matter halos and makes it possible to estimate these positions.

See A Beginners Guide to Dark Matter by David Harvey for more details on the data and the problem of identifying dark matter halos.

Figure 1: Example of lensing effect due to the presence of dark matter.

The organizers of the competition provide a training database with 300 sky maps, Training_Sky1 to Training_Sky300. Each map contains coordinates of galaxies in the sky map, variables x and y, along with their observed ellipticity, variables e1 and e2. There are between 300 and 700 galaxies in each training map. The file Training_halos.csv provides the true halo locations in each map. There may be up to 3 halos in a map. A separate set of 120 test maps, Test_Sky1 to Test_Sky120, is used for evaluation of prediction performance.

In this post I discuss a Bayesian approach to the problem but I don't provide a complete solution. My goal is to show how one can leverage the label information in the training data to fine-tune the hyperparameters in a posterior model which can then be applied to test data. The approach can be generalized to other supervised learning problems.

Likelihood model

I follow the guidelines suggested here to formulate the likelihood function for the observed galaxy ellipticity.

The effect of a dark-matter halo positioned at \((x', y')\) on a galaxy located at \((x, y)\) is given by a pair of tangential ellipticity coefficients \((te_1, te_2)\) according to

\begin{align} \begin{split} te_1 = -\cos(\phi) \\
te_2 = -\sin(\phi) \\
\phi = 2\arctan(\frac{y-y'}{x-x'}) \end{split} \end{align}

From the tangential ellipticity we obtain the observed ellipticity \((e_1, e_2)\) which is proportional to the mass of the halo \(m\) and inversely proportional to the Euclidean distance \(\rho\) between the galaxy and the halo

\begin{align} \begin{split} e_1 = -m\cos(\phi)/\rho \\
e_2 = -m\sin(\phi)/\rho \\
\end{split} \end{align}

The halo mass m is unknown and has to be included as a model parameter. I decide to model it in the log-scale by a parameter lm, thus \(m = \exp(lm)\). That will give me more flexibility in specifying a prior for the mass. Let \(r = 1/\rho\) be the inverse distance between the galaxy and the halo. As noticed by some of the winners in the competition, Tim Salimans and others, it is beneficial to put some lower limit for the distance \(\rho\). This prevents \(r\) from blowing up when the distance gets close to 0 and makes the model more stable. Instead of using the trial and error approach to find an optimal limit, I include it in the model and let it be estimated using the training data. Let the hyperparameter \(rmax\) be the upper limit for \(r\). The ellipticity equations then become

\begin{align} \begin{split} e_1 = -\exp(lm)\cos(\phi)\min(rmax, r) \\
e_2 = -\exp(lm)\sin(\phi)\min(rmax, r) \\
\end{split} \end{align}

It is reasonable to assume that the observed ellipticity include some measurement error modeled by a Gaussian noise with variance given by a parameter \(sig2\). I can then write the likelihood of my model

\begin{align} \begin{split} e_1 \sim \mathcal{N}(-\exp(lm)\cos(\phi)\min(rmax, r), sig2) \\
e_2 \sim \mathcal{N}(-\exp(lm)\sin(\phi)\min(rmax, r), sig2) \end{split} \end{align} The errors for \(e_1\) and \(e_2\) are assumed uncorrelated.

Next, I show the likelihood specification using the bayesmh command in Stata


bayesmh (e1=(exp({lm})*{cos2t:}*{r:})) (e2=(exp({lm})*{sin2t:}*{r:})),  
	likelihood(mvnormal({sig2}))                                    
	define(dx: x-{halox}) define(dy: y-{haloy})                     
	define(cos2t: ({dy:}^2-{dx:}^2)/({dy:}^2+{dx:}^2))              
	define(sin2t:-2*{dx:}*{dy:}/({dy:}^2+{dx:}^2))                  
	define(r: min({rmax}, 1/sqrt({dy:}^2+{dx:}^2)))                
        ...

There are 5 model parameters: {halox} and {haloy}, the coordinates of the unknown halo, {rmax}, {lm}, and {sig2}. I am using {dx:} and {dy:} as placeholders for the coordinate differences between the galaxies and the halo, {cos2t:} and {sin2t:} for the tangential ellipticity coefficients \(te_1\) and \(te_2\), and {r:} for the minimum of the inverse distance \(r\) and \(rmax\).

Supervised and unsupervised Bayesian models

My approach to Bayesian supervised learning goes through two Bayesian models. The first model utilizes the true halo coordinates available in the training data to elicit information about the parameters {rmax}, {lm} and {sig2}. In the second model, I use this information to build a unsupervised model where the halo coordinates are assumed unknown and will be estimated from the observed galaxy ellipticity. Of course, only the second model is applicable to test data.

In the supervised model, I assume that the true halo coordinates {halox} and {haloy} are normally distributed with means halox and haloy, the estimated coordinates available in the training data, and variance {sig2}:

  ...
  prior({halox}{haloy}, mvnormal(2, halox, haloy, {sig2}))
  ...

Note that the variance {sig2} is the same one used in the likelihood function. This choice is based on the assumption that the halo and ellipticity estimations are subject to the same observation error, which is not unreasonable. Lets apply a vague inverse-gamma prior for the variance {sig2}: igamma(0.1, 0.1).

Given the dimensions of each map, a local distance of 10 should be considered small. I therefore expect {rmax} to be less than 0.1 and I assign it a uniform(0.001, 0.1) prior. Finally, I assume that the halo's log-mass follows a normal prior with mean 1 and variance 10. These all seem to be innocuous assumptions. Here it is the complete model specification using the bayesmh command

bayesmh (e1=(exp({lm})*{cos2t:}*{r:})) (e2=(exp({lm})*{sin2t:}*{r:})),  
	likelihood(mvnormal({sig2}))                                     
	define(dx: x-{halox}) define(dy: y-{haloy})                       
	define(cos2t: ({dy:}^2-{dx:}^2)/({dy:}^2+{dx:}^2))                
	define(sin2t:-2*{dx:}*{dy:}/({dy:}^2+{dx:}^2))                    
	define(r: min({rmax}, 1/sqrt({dy:}^2+{dx:}^2)))                   
	prior({rmax},  uniform(0.001, 1))                                 
	prior({lm},    normal(1, 10))                                     
	prior({halox}{haloy}, mvnormal(2,                                 
                              `=mhalo1[1,2]',`=mhalo1[1,3]', {sig2}))     
	prior({sig2}, igamma(0.1, 0.1))

In the second model, I assume that the unknown halo coordinates {halox} and {haloy} follow a completely uninformative prior: uniform(0, 4200), where (0, 4200) is the observed coordinate range in the sky maps.

Finding the halo in the first sky

Lets apply the supervised Bayesian model formulated above to the first training sky map. I first need to load the file containing the known halo coordinates

. import delimited data/Training_halos.csv
. summarize

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       skyid |          0
 numberhalos |        300           2    .8178608          1          3
       x_ref |        300    2148.174    1088.439      27.72     4169.9
       y_ref |        300     2207.16    1106.727        .62    4197.65
     halo_x1 |        300     2153.49    1189.761      10.24    4188.56
-------------+---------------------------------------------------------
     halo_y1 |        300    2220.473    1214.173        .62    4197.65
     halo_x2 |        300    1425.036    1397.369          0       4166
     halo_y2 |        300     1414.77    1419.384          0    4162.19
     halo_x3 |        300    678.4122     1211.98          0    4178.54
     halo_y3 |        300    651.0135    1190.116          0    4185.87

To simplify the case, I only consider maps with 1 halo, so I create a matrix mhalo1 containing the indices and halo coordinates for the 1-halo sky maps

. set seed 12345
. gen skyid1 = _n*(numberhalos==1)
. mkmat skyid1 halo_x1 halo_y1 if skyid1>0, mat(mhalo1)

The halo coordinates in the first sky map, about 1087 and 1115, are given by the matrix entries mhalo1[1,2] and mhalo1[1,3]. I shall supply them to bayesmh using the local macros `=mhalo1[1,2]' and `=mhalo1[1,3]'.

Lets load the dataset of the first sky map which I know has 1 halo only

. import delimited data/sky/Training_Sky1.csv

and simulate the first Bayesian model. In addition to the posterior options, I separate the parameters of interest, {rmax}, {lm}, and {sig2}, into individual blocks (expecting them to be fairly independent) and supply valid initial options. I furthermore request a MCMC sample of size 5000.

. set seed 1
. bayesmh (e1=(exp({lm})*{cos2t:}*{r:})) (e2=(exp({lm})*{sin2t:}*{r:})),  ///
	likelihood(mvnormal({sig2}))                                      ///
	define(dx: x-{halox}) define(dy: y-{haloy})                       ///
	define(cos2t: ({dy:}^2-{dx:}^2)/({dy:}^2+{dx:}^2))                ///
	define(sin2t:-2*{dx:}*{dy:}/({dy:}^2+{dx:}^2))                    ///
	define(r: min({rmax}, 1/sqrt({dy:}^2+{dx:}^2)))                   ///
	prior({rmax},  uniform(0.001, 1))                                 ///
	prior({lm},    normal(1, 10))                                     ///
	prior({halox}{haloy}, mvnormal(2,                                 ///
                              `=mhalo1[1,2]',`=mhalo1[1,3]', {sig2}))     ///
	prior({sig2}, igamma(0.1, 0.1))                                   ///
	block({lm}) block({rmax}) block({sig2})                           ///
	init({lm} 2 {rmax} 0.1 {sig2} 1)                                  ///
	mcmcs(5000) dots
  
Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........50
> 00 done

Model summary
------------------------------------------------------------------------------
Likelihood: 
  e1 e2 ~ mvnormal(2,exp({lm})*xb_cos2t*xb_r,exp({lm})*xb_sin2t*xb_r,{sig2})

Priors: 
  {halox haloy} ~ mvnormal(2,1086.800048828125,1114.609985351563,{sig2})
         {rmax} ~ uniform(0.001,0.1)
         {sig2} ~ igamma(0.1,0.1)
           {lm} ~ normal(1,10)
------------------------------------------------------------------------------

Bayesian multivariate normal regression          MCMC iterations  =      7,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =      5,000
                                                 Number of obs    =        348
                                                 Acceptance rate  =      .3927
                                                 Efficiency:  min =      .1143
                                                              avg =       .157
Log marginal likelihood =  85.417779                          max =      .1963
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
       halox |  1086.806   .2058405   .008609   1086.804   1086.383   1087.201
       haloy |  1114.601   .2135336   .008191   1114.597   1114.172   1115.018
        rmax |  .0530633   .0280319   .000905   .0537841   .0064036   .0978673
          lm |   3.84977   .2919037   .010786   3.894357   3.139424   4.279701
        sig2 |  .0450231   .0024271   .000077   .0449918   .0404852   .0501556
------------------------------------------------------------------------------

Unsurprisingly, the posterior mean estimates for {halox} and {haloy} match the true halo coordinates due to the prior we specified. My interest however is in the other 3 parameters. Posterior means and standard deviations of {rmax}, {lm}, and {sig2} provide us with important information about these parameters which is relevant for all other sky maps. For example, we see that the variance parameter {sig2} is about 0.045 and this is expected to be true for all other training and testing sky maps.

To formulate the second, unsupervised, Bayesian model, I use the posterior mean estimates for {rmax}, {lm}, and {sig2} from the first model to specify more informative priors. Lets save the posterior summaries of the parameters of interest in a matrix msum

. bayesstats summ {rmax} {lm} {sig2}
. mat msum = r(summary)

Now, the matrix entries msum[1,1] and msum[1,2] contain the posterior mean and standard deviations of {rmax}, which I use to specify an informative prior for {rmax}:

 normal(`=msum[1,1]', `=msum[1,2]^2'))

note that the standard deviation is squared. The entries msum[2,1] and msum[2,2] contain the posterior mean and standard deviations for {lm}, which I use to specify an informative prior for {lm}:

 normal(`=msum[2,1]', `=msum[2,2]^2'))

I also use the posterior mean of {sig2}, msum[3,1], to replace the variance parameter in the likelihood of the first model. As I mentioned above, the halo coordinates {halox} and {haloy} now have completely uninformative priors uniform(0, 4200).

What follows is the full specification of the second Bayesian model along with the simulation results.

. set seed 1
. bayesmh (e1=(exp({lm})*{cos2t:}*{r:})) (e2=(exp({lm})*{sin2t:}*{r:})), ///
	likelihood(mvnormal(`=msum[3,1]'))                               ///
	define(dx: x-{halox}) define(dy: y-{haloy})                      ///
	define(r: min({rmax}, 1/sqrt({dy:}^2+{dx:}^2)))                  ///
	define(cos2t: ({dy:}^2-{dx:}^2)/({dy:}^2+{dx:}^2))               ///
	define(sin2t:-2*{dx:}*{dy:}/({dy:}^2+{dx:}^2))                   ///
	prior({rmax}, normal(`=msum[1,1]', `=msum[1,2]^2'))              ///
	prior({lm},   normal(`=msum[2,1]', `=msum[2,2]^2'))              ///
	prior({halox}, uniform(0, 4200))                                 ///
	prior({haloy}, uniform(0, 4200))                                 ///
	init({lm} 1) block({lm}) mcmcs(5000) dots

Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........50
> 00 done

Model summary
------------------------------------------------------------------------------
Likelihood: 
  e1 e2 ~ mvnormal(2,expr6,expr7,.0450230840619126)

Priors: 
  {halox haloy} ~ uniform(0,4200)
         {rmax} ~ normal(.0530632648891123,.0007857890494254)
           {lm} ~ normal(3.849770050209967,.0852077452213579)

Expressions: 
  expr6 : exp({lm})*xb_cos2t*xb_r
  expr7 : exp({lm})*xb_sin2t*xb_r
------------------------------------------------------------------------------

Bayesian multivariate normal regression          MCMC iterations  =      7,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =      5,000
                                                 Number of obs    =        348
                                                 Acceptance rate  =      .3793
                                                 Efficiency:  min =      .1038
                                                              avg =      .1466
Log marginal likelihood =  88.063805                          max =       .208
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
       halox |  1046.183   43.07794   1.84272   1049.074   944.8554   1123.163
       haloy |  1090.764   64.73769   2.84159   1092.794   952.1675   1202.468
        rmax |  .0559463   .0252325   .000783   .0552896    .010796   .1063793
          lm |  3.880156   .2010585   .006996   3.896191   3.461298   4.233885
------------------------------------------------------------------------------

The posterior mean estimates for the halo coordinates in the first sky map, about (1046, 1091), are very close to the true halo coordinates (1087, 1115). Of course, this is not so impressive given that we have used information from the first sky map to tune the priors for second model. Lets verify the validity of the model on independent data - the second training sky map.

. import delimited data/sky/Training_Sky2.csv

The second sky map also have only 1 halo and I know its true coordinates, (3478, 1907).

Lets run the same Bayesian model as above

. set seed 2
. bayesmh (e1=(exp({lm})*{cos2t:}*{r:})) (e2=(exp({lm})*{sin2t:}*{r:})), ///
	likelihood(mvnormal(`=msum[3,1]'))                               ///
	define(dx: x-{halox}) define(dy: y-{haloy})                      ///
	define(r: min({rmax}, 1/sqrt({dy:}^2+{dx:}^2)))                  ///
	define(cos2t: ({dy:}^2-{dx:}^2)/({dy:}^2+{dx:}^2))               ///
	define(sin2t:-2*{dx:}*{dy:}/({dy:}^2+{dx:}^2))                   ///
	prior({rmax}, normal(`=msum[1,1]', `=msum[1,2]^2'))              ///
	prior({lm},   normal(`=msum[2,1]', `=msum[2,2]^2'))              ///
	prior({halox}, uniform(0, 4200))                                 ///
	prior({haloy}, uniform(0, 4200))                                 ///
	init({lm} 1) block({lm}) mcmcs(5000) dots

Burn-in 2500 aaaaaaaaa1000aaaaaaaaa2000aaaaa done
Simulation 5000 .........1000.........2000.........3000.........4000.........50
> 00 done

Model summary
------------------------------------------------------------------------------
Likelihood: 
  e1 e2 ~ mvnormal(2,expr6,expr7,.0450230840619126)

Priors: 
  {halox haloy} ~ uniform(0,4200)
         {rmax} ~ normal(.0530632648891123,.0007857890494254)
           {lm} ~ normal(3.849770050209967,.0852077452213579)

Expressions: 
  expr6 : exp({lm})*xb_cos2t*xb_r
  expr7 : exp({lm})*xb_sin2t*xb_r
------------------------------------------------------------------------------

Bayesian multivariate normal regression          MCMC iterations  =      7,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =      5,000
                                                 Number of obs    =        589
                                                 Acceptance rate  =      .3812
                                                 Efficiency:  min =      .1069
                                                              avg =      .1661
Log marginal likelihood =  147.49849                          max =      .2201
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
       halox |  3457.543   27.79345   1.11984   3456.117   3406.868   3512.155
       haloy |  1886.223   24.59853    1.0639   1886.248   1840.108   1932.195
        rmax |  .0558301   .0255934   .000771   .0544789   .0119201    .107901
          lm |  3.873674   .1668635   .005101   3.882131   3.518733    4.17672
------------------------------------------------------------------------------
Note: Adaptation tolerance is not met in at least one of the blocks.

The posterior mean estimates for the halo coordinates in the second sky map are about (3458, 1886). These are indeed close to the true halo coordinates (3478, 1907) and this is now a satisfying verification of our model.