nachos.data package

Submodules

nachos.data.Data module

class nachos.data.Data.Data(id, factors, field_names=None)[source]

Bases: object

Summary:

A structure to store the factors (including those that will be used as constraints) associated with records in a tsv file, dataframe, or lhotse manifest.

__init__(id, factors, field_names=None)[source]
copy()[source]
Return type

Data

class nachos.data.Data.Dataset(data, factor_idxs, constraint_idxs)[source]

Bases: object

Summary: A class to store and manipulate the data and their associated factors and constraints. The structure we ultimately want is similar to an inverted index.

factors = [
{

factor1_value1: [fid1, fid2, …], factor1_value2: [fids, …], …

}, {

factor2_value1: […], factor2_value2: […],

]

__init__(data, factor_idxs, constraint_idxs)[source]
subset_from_data(d)[source]
Summary:

Create a new subset, with the same factors and constraints as self, from a subset of the data points.

Parameters

d (Iterable[Data]) – The data points from which to create a Dataset

Returns

A Dataset object representing the subset of points

Return type

Dataset

subset_from_records(r)[source]
Summary:

Create a new subset, with the same factors and constraints as self, from a subset of the data points.

Return type

Dataset

check_complete()[source]
Summary:

Checks if the graph is complete :return: True if complete, False otherwise :rtype: bool

Return type

bool

check_disconnected()[source]
Summary:

Checks if the graph if there are M > 1 disconnected components in the graph.

Returns

True is disconnected, False otherwise

Return type

bool

make_graph(simfuns)[source]
Summary:

Makes the graph representation of the dataset. This assumes that the graph is undirected, an assumption which we may later break, depending on the kinds of similarity functiosn we will ultimately support. It also makes subgraphs corresponding to each individual factor value. This is like the inverted index. You can lookup the neighbors of a factors. The graph and factors are stored in self.graph and self.graphs respectively.

param simfuns

the similarity functions (1 per factor) used to compare records (i.e., data points)

type simfuns

nachos.SimilarityFunctions.SimilarityFunctions

Return type

None

get_record(i)[source]
Return type

Any

export_graph(filename)[source]
Summary:

Exports graph to .gml file which in theory can be read for visualization.

Parameters

filename (str) – the filename of the .gml file to create

Returns

None

Return type

None

get_constraints(subset=None, n=None)[source]
Summary:

Returns a generator over the dataset constraints.

Parameters
  • subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use

  • n (Optional[int]) – The constraint index to return. By default it is None, which means to return all the constraints.

Returns

generator over constraints

Return type

Generator

get_factors(subset=None, n=None)[source]
Summary:

Returns a generator over the dataset factors.

Parameters
  • subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use

  • n (Optional[int]) – The factor index to return. By default it is None, which means to return all factors.

Returns

generator over factors

Return type

Generator

make_constraint_inverted_index()[source]
Summary:

Sets the inverted index for the constraints. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th constraint.

Return type

None

make_factor_inverted_index()[source]
Summary:

Returns the inverted index for the factors. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th factor. This is really not a particularly useful function, as the inverted index computed in this way only works for the set_intersection similarity method. For other types of similarity, such as cosine distance, self.make_graph() will make a the graphs corresponding to each factor, and is really a better version of the inverted index created in this function.

This function therefore exists mostly to mirror what the make_cosntraint_inverted_index function.

Return type

None

draw_random_split_from_factor(n)[source]
Summary:

Return a set of Data point ids and its complement corresponding to the inclusion of a subset of values selected from the n-th factor into the “training” set. We also return the index of the set from the powerset of values that resulted in the split.

Parameters

n (int) – the index of the factor in the list self.factor_idxs from which to select

Returns

The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index

Return type

Tuple[int, Tuple[set, set]]

draw_split_from_factor(n, idx)[source]
Summary:

Like draw_random_split from factor, but draws the split specified by an integer index, idx, which specifies the subset of values from the powerset of values from factor n to use.

Parameters
  • n (int) – the index of the factor in the list self.factor_idxs from which to select

  • idx (int) – The index in the powerset of the subset of values from the n-th factor to use.

Returns

The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index

Return type

Tuple[int, Tuple[set, set]]

draw_random_split()[source]
Summary:

Applies self.draw_random_split_from_factor() to each factor independently, and returns all of the splits.

Returns

The keys (indices into the powersets of values for each factor), and the values (the selected Dataset and its complement) for each factor.

Return type

Tuple[List[int], List[Tuple[Dataset, Dataset]]]

set_random_seed(seed=0)[source]
Summary:

Set the random seed of the random module

Parameters

seed (int) – Default to 0. It’s the random module’s random seed

Return type

None

nearby_splits(idxs, split)[source]
Summary:

Make a generator over “neaby splits”. These are splits that are Hamming distance 1 away from the current split. By this we mean if you concatenated the bit strings representing the indices of the powersets of values for each factor, then any bit string that differs in a single value.

Parameters
  • idx – The indices into the powersets of the subset corresponding to split

  • split (FactoredSplit) – a split (a factored split actually) around which we want to find splits that are Hamming distance = 1 away

Returns

a generator over the neighboring splits

Return type

Generator[FactoredSplit]

get_neighborhood(idxs, split, l, max_neighbors=2000)[source]
Summary:

Return a generator over all of the neighbors at distance l from split.

Parameters
  • idxs (List[int]) – The list of indices, for each factor into their respective powersets of the corresponding to the splits

  • split (FactoredSplit) – The split whose neighbors at distance l we want to generate.

  • l (int) – The distance from split of the neighbors we would like to generate

  • max_neighbors (int) – The maximum number of neighbors to explore

Returns

A generator over the neighbors at distance l from split

Return type

Generator[FactoredSplit]

shake(idxs, split, k)[source]
Summary:

Return a random split from the neighborhood around split.

Parameters
  • idx – The index o

  • split (FactoredSplit) – The current split around which we will select a random neighbor

  • k (int) – The distance from the split form which our new split, obtained by shaking will be drawn from. Kind of like a shake distance

Returns

The randomly selected split from the neighborhood of split

Return type

FactoredSplit

draw_random_node_cut()[source]
Summary:

Draw random, non-adjacenent verticies as source and target nodes, and compute the minimum st-vertex cut. This cut may result in > 2 components. In this case, randomly assign the components to different splits.

Returns

the split of components

Return type

Split

make_overlapping_test_sets(split)[source]
Summary:

Takes a split of the dataset, i.e., two subsets of the dataset that do not overlap in the specified factors, and from the remaining data in the dataset not included in the split, creates multiple test sets that have some overlap with respect to one or more factors in the first of the two subsets in split.

In general, there are 2^N different kinds of overlap when using N factors. By overlap, we mean factors that are considered under the similarity function used to create the graph. We can use the factors specific graphs for this purpose.

Parameters

split (Split) – The split (i.e., two subsets of the data sets) with no factor overlap with respect to which we are making the additional test sets.

Returns

Test sets

Return type

List[set]

overlap_stats(s1, s2)[source]
Summary:

Compute the overlap s2 w/r s1 “stats” associated with each factor.

Parameters
  • s1 (set) – The set with respect to which overlap will be computed

  • s2 (set) – the set whose overlap is computed with respect to s1

Returns

The dictionary of factors overlaps (s2 w/r to s1)

Return type

dict

nachos.data.Data.collapse_factored_split(split)[source]
Summary:

Take a FactoredSplit and collapse it by intersecting all the selected set, and intersecting all of their complements to create a single selected set and a single other split with no overlap in any of the factors present in the selected set.

Parameters

split (FactoredSplit) – The split to collapse

Returns

the collapsed split

Return type

Split

nachos.data.Input module

class nachos.data.Input.TSVLoader[source]

Bases: object

static load(fname, config)[source]
class nachos.data.Input.PandasLoader[source]

Bases: object

__init__()[source]
class nachos.data.Input.LhotseLoader[source]

Bases: object

__init__()[source]

Module contents