## A Python implementation of how to select a sample with maximal coverage of a population while minimising punishing-costs.

18,000,000,000 km away, Voyager 2 left the warming afterglow of solar-winds from the sun that is the heliosphere, and ventured onwards towards interstellar space.

Since launching on the 20th August 1977, the spacecraft has enjoyed an expansive panorama of our solar system — observing Jupiter in 1979, Saturn from 1981 and Uranus towards the end of 1985.

Given the vastness of space, it begs the question — if you have a limited selection spots available, how can you determine the best sample of a population?

In this case, I assume the “best” to be a selection’s relative uniqueness against the entire population. For instance, how unique is Mars amongst all other planets in our solar system. But to reach Mars requires a compromise, a rejection of alternatives, say the set of Mercury and Venus.

In short, we want to **find the set of N items that are the most insightful about our population while minimising the cost in reaching them.**

Using Python3, we firstly want to calculate the distinction amongst each item. We can imagine a list describing each planet:

#each key contains a list of values representative of planet's properties

Planets = {}

'''

planet list => index_0 = radius-in-km, index_1 = weight-in-kg (strip exponent for example), index_2 = human-population, index_3 = O2-concentration-in-atmosphere-percent, index_4 = gravity-acceleration-in-m/s^2.

'''

planets["Mercury"] = [2440, 3.285, 0, 0.13, 3.59]

planets["Venus"] = [6052, 4.867, 0, 0, 8.87]

planets["Earth"] = [6371, 5.972, 7530000000, 20.95, 9.81]

planets["Mars"] = [3390, 6.39, 0, 0.146, 3.77]

planets["Jupiter"] = [69911, 1.898, 0, 0, 25.95]

planets["Saturn"] = [58232, 5.683, 0, 0, 11.08]

planets["Uranus"] = [25362, 8.681, 0, 0, 10.67]

planets["Neptune"] = [24622, 1.024, 0, 0, 14.07]

What we want to achieve is a vector/planet so that we can use cosine-similarity to determine the angular-displacement of each planet from one another. Cosine-similarity is heavily utilised for information-retrieval to determine the respective value of a document against a set of documents. Visually, we want a representation of our planets so we can just measure the angles between them like so:

Determining cosine similarity between two vectors is achieved with:

planetList = [*planets]

#determine cosineSimilarity for each planet

results = {}

for planet in planetList:

cosineResults = {}

for altPlanet in planetList:

if altPlanet is not planet:

#determine cosine similarity

cosineResult = dot(planets[planet], planets[altPlanet])/(norm(planets[planet])*norm(planets[altPlanet]))

#print(planet + " VS " + altPlanet + "===" + str(cosineResult)) cosineResults[altPlanet] = cosineResult

results[planet] = cosineResults

We’ll receive a rational number representing the angular-displacement between these vectors such as:

0.35869123

So we can now interpret this as saying Earth is approximately 36% similar to Mars.

Iterating through our planets and archiving our `cosineResult`

between `Planet N `

and `Planet N+1`

, we’re able to develop a matrix of respective distinction scores amongst our items.

Next, we need to calculate the cost of the relationship between our planets. That is, to understand the cost of travelling from `Planet N `

and `Planet N+1`

.

Naturally, this is a multivariable problem involving a range of facets such as distance, time, potential hazards, the impact of solar radiation etc. To concentrate focus on the idea of measuring reward/cost we’ll just consider distance between planets. That is, the cost of “Earth” to “Mars” is the distance between them, which is on average, 225 million km. To keep our scores at a manageable scale, we’ll drop the “million” so our final cost is just `225`

.

Computing our score for the trip from Earth to Mars is simply reward/cost:

`score = 0.35869123/``225`

0.00159418324

We reach now an interesting inflexion point where our model for optimising can vary with respect to the amount of domain-knowledge and historic-knowledge available. In some cases there may be data from historic trips (Voyager 1) that we can exploit to understand latent benefits/perils of adventuring to a particular planet.

If such a scenario was present, we could explore our options with a Monte-Carlo Search Tree or an even more rudimentary model like multiple linear regression.

In this case, we don’t have any data like that so we can only consider the relative scores available to us. As such, I think the best way to assess the optimal sample is to then sort our scores with the following structure:

#create list of paired planets and scores

scoreRanks = [[planetA, planetB], score-of-planetA-to-planetB],...]

Sorting in ascending order leaves us with:

import operator

scoreRanks = sorted(scoreRanks, key=itemgetter(1))

Lastly, say we want to retrieve 3-planets that represents the best combination for maximising variance and minimising cost. We can achieve this by recursively iterating through our sorted `scoreRanks`

.

def planetCombination(self, ranks, length, result):

#iteratively determine optimal ranking and then return the top-number of hits determined by LENGTH

if len(set(result)) == length:

return result

else: #pull highest value

if len(result) == 0:

result.append(scoreRanks[0][0][0])

result.append(scoreRanks[0][0][1])

#find highest value from last item

lastItem = result[-1]

#retrieve highest value from last item that isn't in results for entry in scoreRanks:

if entry[0][0] == lastItem and entry[0][1] not in result:

result.append(entry[0][1])

return self.planetCombination(scoreRanks, 3, result)

Our final outputted list will show us a list of a combination of three planets (our requirement of three is a fixed-input in this example, assigned as the variable `length`

), that satisfies our base-case.

Github link is here.

The above is an unoptimised solution. If you can think of any superior alternatives or suggestions please let me know!

Thanks!