GPy.util package¶

Introduction¶

A variety of utility functions including matrix operations and quick access to test datasets.

Submodules¶

GPy.util.block_matrices module¶

block_dot(A, B, diagonal=False)[source]¶

Element wise dot product on block matricies

+——+——+ +——+——+ +——-+——-+ | | | | | | |A11.B11|B12.B12| | A11 | A12 | | B11 | B12 | | | | +——+——+ o +——+——| = +——-+——-+ | | | | | | |A21.B21|A22.B22| | A21 | A22 | | B21 | B22 | | | | +————-+ +——+——+ +——-+——-+

..Note

If any block of either (A or B) are stored as 1d vectors then we assume that it denotes a diagonal matrix efficient dot product using numpy broadcasting will be used, i.e. A11*B11

If either (A or B) of the diagonal matrices are stored as vectors then a more efficient dot product using numpy broadcasting will be used, i.e. A11*B11

get_block_shapes(B)[source]¶

get_block_shapes_3d(B)[source]¶

get_blocks(A, blocksizes)[source]¶

get_blocks_3d(A, blocksizes, pagesizes=None)[source]¶: Given a 3d matrix, make a block matrix, where the first and second dimensions are blocked according to blocksizes, and the pages are blocked using pagesizes

unblock(B)[source]¶

GPy.util.choleskies module¶

backprop_gradient(dL, L)¶

Given the derivative of an objective fn with respect to the cholesky L, compute the derivate with respect to the original matrix K, defined as

K = LL^T

where L was obtained by Cholesky decomposition

flat_to_triang(flat_mat)¶

indexes_to_fix_for_low_rank(rank, size)[source]¶: Work out which indexes of the flatteneed array should be fixed if we want the cholesky to represent a low rank matrix

multiple_dpotri(Ls)[source]¶

safe_root(N)[source]¶

triang_to_cov(L)[source]¶

triang_to_flat(L)¶

GPy.util.choleskies_cython module¶

GPy.util.classification module¶

conf_matrix(p, labels, names=['1', '0'], threshold=0.5, show=True)[source]¶

Returns error rate and true/false positives in a binary classification problem - Actual classes are displayed by column. - Predicted classes are displayed by row.

Parameters:	p – array of class ‘1’ probabilities. labels – array of actual classes. names – list of class names, defaults to [‘1’,’0’]. threshold – probability value used to decide the class. show (False\|True) – whether the matrix should be shown or not

GPy.util.cluster_with_offset module¶

cluster(data, inputs, verbose=False)[source]¶

Clusters data

Using the new offset model, this method uses a greedy algorithm to cluster the data. It starts with all the data points in separate clusters and tests whether combining them increases the overall log-likelihood (LL). It then iteratively joins pairs of clusters which cause the greatest increase in the LL, until no join increases the LL.

arguments: inputs – the ‘X’s in a list, one item per cluster data – the ‘Y’s in a list, one item per cluster

returns a list of the clusters.

get_log_likelihood(inputs, data, clust)[source]¶

Get the LL of a combined set of clusters, ignoring time series offsets.

Get the log likelihood of a cluster without worrying about the fact different time series are offset. We’re using it here really for those cases in which we only have one cluster to get the loglikelihood of.

arguments: inputs – the ‘X’s in a list, one item per cluster data – the ‘Y’s in a list, one item per cluster clust – list of clusters to use

returns a tuple: log likelihood and the offset (which is always zero for this model)

get_log_likelihood_offset(inputs, data, clust)[source]¶

Get the log likelihood of a combined set of clusters, fitting the offsets

arguments: inputs – the ‘X’s in a list, one item per cluster data – the ‘Y’s in a list, one item per cluster clust – list of clusters to use

returns a tuple: log likelihood and the offset

GPy.util.config module¶

GPy.util.datasets module¶

authorize_download(dataset_name=None)[source]¶: Check with the user that the are happy with terms and conditions for the data set.

boston_housing(data_set='boston_housing')[source]¶

boxjenkins_airline(data_set='boxjenkins_airline', num_train=96)[source]¶

brendan_faces(data_set='brendan_faces')[source]¶

cifar10_patches(data_set='cifar-10')[source]¶: The Candian Institute for Advanced Research 10 image data set. Code for loading in this data is taken from this Boris Babenko’s blog post, original code available here: http://bbabenko.tumblr.com/post/86756017649/learning-low-level-vision-feautres-in-10-lines-of-code

cmu_mocap(subject, train_motions, test_motions=[], sample_every=4, data_set='cmu_mocap')[source]¶: Load a given subject’s training and test motions from the CMU motion capture data.

cmu_mocap_35_walk_jog(data_set='cmu_mocap')[source]¶: Load CMU subject 35’s walking and jogging motions, the same data that was used by Taylor, Roweis and Hinton at NIPS 2007. but without their preprocessing. Also used by Lawrence at AISTATS 2007.

cmu_mocap_49_balance(data_set='cmu_mocap')[source]¶: Load CMU subject 49’s one legged balancing motion that was used by Alvarez, Luengo and Lawrence at AISTATS 2009.

cmu_urls_files(subj_motions, messages=True)[source]¶: Find which resources are missing on the local disk for the requested CMU motion capture motions.

creep_data(data_set='creep_rupture')[source]¶: Brun and Yoshida’s metal creep rupture data.

crescent_data(num_data=200, seed=10000)[source]¶

Data set formed from a mixture of four Gaussians. In each class two of the Gaussians are elongated at right angles to each other and offset to form an approximation to the crescent data that is popular in semi-supervised learning as a toy problem.

param num_data_part:

number of data to be sampled (default is 200).

type num_data: int

param seed: random seed to be used for data generation.

type seed: int

data_available(dataset_name=None)[source]¶: Check if the data set is available on the local machine already.

data_details_return(data, data_set)[source]¶: Update the data component of the data dictionary with details drawn from the data_resources.

decampos_digits(data_set='decampos_characters', which_digits=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])[source]¶

della_gatta_TRP63_gene_expression(data_set='della_gatta', gene_number=None)[source]¶

download_data(dataset_name=None)[source]¶: Check with the user that the are happy with terms and conditions for the data set, then download it.

download_rogers_girolami_data(data_set='rogers_girolami_data')[source]¶

download_url(url, store_directory, save_name=None, messages=True, suffix='')[source]¶: Download a file from a url and save it to disk.

drosophila_knirps(data_set='drosophila_protein')[source]¶

drosophila_protein(data_set='drosophila_protein')[source]¶

football_data(season='1314', data_set='football_data')[source]¶: Football data from English games since 1993. This downloads data from football-data.co.uk for the given season.

fruitfly_tomancak(data_set='fruitfly_tomancak', gene_number=None)[source]¶

global_average_temperature(data_set='global_temperature', num_train=1000, refresh_data=False)[source]¶

google_trends(query_terms=['big data', 'machine learning', 'data science'], data_set='google_trends', refresh_data=False)[source]¶

Data downloaded from Google trends for given query terms.

Warning, if you use this function multiple times in a row you get blocked due to terms of service violations. The function will cache the result of your query, if you wish to refresh an old query set refresh_data to True.

The function is inspired by this notebook: http://nbviewer.ipython.org/github/sahuguet/notebooks/blob/master/GoogleTrends%20meet%20Notebook.ipynb

hapmap3(data_set='hapmap3')[source]¶

The HapMap phase three SNP dataset - 1184 samples out of 11 populations.

SNP_matrix (A) encoding [see Paschou et all. 2007 (PCA-Correlated SNPs…)]: Let (B1,B2) be the alphabetically sorted bases, which occur in the j-th SNP, then

/ 1, iff SNPij==(B1,B1)

Aij = | 0, iff SNPij==(B1,B2): -1, iff SNPij==(B2,B2)

The SNP data and the meta information (such as iid, sex and phenotype) are stored in the dataframe datadf, index is the Individual ID, with following columns for metainfo:

family_id -> Family ID

paternal_id -> Paternal ID

maternal_id -> Maternal ID

sex -> Sex (1=male; 2=female; other=unknown)

phenotype -> Phenotype (-9, or 0 for unknown)

population -> Population string (e.g. ‘ASW’ - ‘YRI’)

rest are SNP rs (ids)

More information is given in infodf:

Chromosome:

autosomal chromosemes -> 1-22

X X chromosome -> 23

Y Y chromosome -> 24

XY Pseudo-autosomal region of X -> 25

MT Mitochondrial -> 26

Relative Positon (to Chromosome) [base pairs]

isomap_faces(num_samples=698, data_set='isomap_face_data')[source]¶

lee_yeast_ChIP(data_set='lee_yeast_ChIP')[source]¶

mauna_loa(data_set='mauna_loa', num_train=545, refresh_data=False)[source]¶

oil(data_set='three_phase_oil_flow')[source]¶: The three phase oil data from Bishop and James (1993).

oil_100(seed=10000, data_set='three_phase_oil_flow')[source]¶

olivetti_faces(data_set='olivetti_faces')[source]¶

olivetti_glasses(data_set='olivetti_glasses', num_training=200, seed=10000)[source]¶

olympic_100m_men(data_set='rogers_girolami_data')[source]¶

olympic_100m_women(data_set='rogers_girolami_data')[source]¶

olympic_200m_men(data_set='rogers_girolami_data')[source]¶

olympic_200m_women(data_set='rogers_girolami_data')[source]¶

olympic_400m_men(data_set='rogers_girolami_data')[source]¶

olympic_400m_women(data_set='rogers_girolami_data')[source]¶

olympic_marathon_men(data_set='olympic_marathon_men')[source]¶

olympic_sprints(data_set='rogers_girolami_data')[source]¶: All olympics sprint winning times for multiple output prediction.

osu_run1(data_set='osu_run1', sample_every=4)[source]¶

prompt_user(prompt)[source]¶: Ask user for agreeing to data set licenses.

pumadyn(seed=10000, data_set='pumadyn-32nm')[source]¶

reporthook(a, b, c)[source]¶

ripley_synth(data_set='ripley_prnn_data')[source]¶

robot_wireless(data_set='robot_wireless')[source]¶

sample_class(f)[source]¶

silhouette(data_set='ankur_pose_data')[source]¶

simulation_BGPLVM()[source]¶

singlecell(data_set='singlecell')[source]¶

singlecell_rna_seq_deng(dataset='singlecell_deng')[source]¶

singlecell_rna_seq_islam(dataset='singlecell_islam')[source]¶

sod1_mouse(data_set='sod1_mouse')[source]¶

spellman_yeast(data_set='spellman_yeast')[source]¶

spellman_yeast_cdc15(data_set='spellman_yeast')[source]¶

swiss_roll(num_samples=3000, data_set='swiss_roll')[source]¶

swiss_roll_1000()[source]¶

swiss_roll_generated(num_samples=1000, sigma=0.0)[source]¶

toy_linear_1d_classification(seed=10000)[source]¶

toy_rbf_1d(seed=10000, num_samples=500)[source]¶

Samples values of a function from an RBF covariance with very small noise for inputs uniformly distributed between -1 and 1.

Parameters:	seed (int) – seed to use for random sampling. num_samples (int) – number of samples to sample in the function (default 500).

toy_rbf_1d_50(seed=10000)[source]¶

xw_pen(data_set='xw_pen')[source]¶

GPy.util.debug module¶

The module for some general debug tools

checkFinite(arr, name=None)[source]¶

checkFullRank(m, tol=1e-10, name=None, force_check=False)[source]¶

GPy.util.decorators module¶

silence_errors(f)[source]¶: This wraps a function and it silences numpy errors that happen during the execution. After the function has exited, it restores the previous state of the warnings.

GPy.util.diag module¶

add(A, b, offset=0)[source]¶

Add b to the view of A in place (!). Returns modified A. Broadcasting is allowed, thus b can be scalar.