nachos.data package

Submodules

nachos.data.Data module

class nachos.data.Data.Data(id, factors, field_names=None)[source]

Bases: object

Summary:: A structure to store the factors (including those that will be used as constraints) associated with records in a tsv file, dataframe, or lhotse manifest.

__init__(id, factors, field_names=None)[source]

copy()[source]

Return type: Data

class nachos.data.Data.Dataset(data, factor_idxs, constraint_idxs)[source]

Bases: object

Summary: A class to store and manipulate the data and their associated factors and constraints. The structure we ultimately want is similar to an inverted index.

factors = [

{: factor1_value1: [fid1, fid2, …], factor1_value2: [fids, …], …

}, {

factor2_value1: […], factor2_value2: […],

]

__init__(data, factor_idxs, constraint_idxs)[source]

subset_from_data(d)[source]

Summary:: Create a new subset, with the same factors and constraints as self, from a subset of the data points.

Parameters: d (Iterable[Data]) – The data points from which to create a Dataset

Returns: A Dataset object representing the subset of points
Return type: Dataset

subset_from_records(r)[source]

Summary:: Create a new subset, with the same factors and constraints as self, from a subset of the data points.

Return type: Dataset

check_complete()[source]

Summary:: Checks if the graph is complete :return: True if complete, False otherwise :rtype: bool

Return type: bool

check_disconnected()[source]

Summary:: Checks if the graph if there are M > 1 disconnected components in the graph.

Returns: True is disconnected, False otherwise
Return type: bool

make_graph(simfuns)[source]

Summary:: Makes the graph representation of the dataset. This assumes that the graph is undirected, an assumption which we may later break, depending on the kinds of similarity functiosn we will ultimately support. It also makes subgraphs corresponding to each individual factor value. This is like the inverted index. You can lookup the neighbors of a factors. The graph and factors are stored in self.graph and self.graphs respectively.

param simfuns

the similarity functions (1 per factor) used to compare records (i.e., data points)

type simfuns

nachos.SimilarityFunctions.SimilarityFunctions

Return type: None

get_record(i)[source]

Return type: Any

export_graph(filename)[source]

Summary:: Exports graph to .gml file which in theory can be read for visualization.

Parameters: filename (str) – the filename of the .gml file to create

Returns: None
Return type: None

get_constraints(subset=None, n=None)[source]

Summary:: Returns a generator over the dataset constraints.

Parameters

subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use
n (Optional[int]) – The constraint index to return. By default it is None, which means to return all the constraints.

Returns: generator over constraints
Return type: Generator

get_factors(subset=None, n=None)[source]

Summary:: Returns a generator over the dataset factors.

Parameters

subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use
n (Optional[int]) – The factor index to return. By default it is None, which means to return all factors.

Returns: generator over factors
Return type: Generator

make_constraint_inverted_index()[source]

Summary:: Sets the inverted index for the constraints. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th constraint.

Return type: None

make_factor_inverted_index()[source]

Summary:

Returns the inverted index for the factors. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th factor. This is really not a particularly useful function, as the inverted index computed in this way only works for the set_intersection similarity method. For other types of similarity, such as cosine distance, self.make_graph() will make a the graphs corresponding to each factor, and is really a better version of the inverted index created in this function.

This function therefore exists mostly to mirror what the make_cosntraint_inverted_index function.

Return type: None

draw_random_split_from_factor(n)[source]

Summary:: Return a set of Data point ids and its complement corresponding to the inclusion of a subset of values selected from the n-th factor into the “training” set. We also return the index of the set from the powerset of values that resulted in the split.

Parameters: n (int) – the index of the factor in the list self.factor_idxs from which to select

Returns: The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index
Return type: Tuple[int, Tuple[set, set]]

draw_split_from_factor(n, idx)[source]

Summary:: Like draw_random_split from factor, but draws the split specified by an integer index, idx, which specifies the subset of values from the powerset of values from factor n to use.

Parameters

n (int) – the index of the factor in the list self.factor_idxs from which to select
idx (int) – The index in the powerset of the subset of values from the n-th factor to use.

Returns: The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index
Return type: Tuple[int, Tuple[set, set]]

draw_random_split()[source]

Summary:: Applies self.draw_random_split_from_factor() to each factor independently, and returns all of the splits.

Returns: The keys (indices into the powersets of values for each factor), and the values (the selected Dataset and its complement) for each factor.
Return type: Tuple[List[int], List[Tuple[Dataset, Dataset]]]

set_random_seed(seed=0)[source]

Summary:: Set the random seed of the random module

Parameters: seed (int) – Default to 0. It’s the random module’s random seed
Return type: None

nearby_splits(idxs, split)[source]

Summary:: Make a generator over “neaby splits”. These are splits that are Hamming distance 1 away from the current split. By this we mean if you concatenated the bit strings representing the indices of the powersets of values for each factor, then any bit string that differs in a single value.

Parameters

idx – The indices into the powersets of the subset corresponding to split
split (FactoredSplit) – a split (a factored split actually) around which we want to find splits that are Hamming distance = 1 away

Returns: a generator over the neighboring splits
Return type: Generator[FactoredSplit]

get_neighborhood(idxs, split, l, max_neighbors=2000)[source]

Summary:: Return a generator over all of the neighbors at distance l from split.

Parameters

idxs (List[int]) – The list of indices, for each factor into their respective powersets of the corresponding to the splits
split (FactoredSplit) – The split whose neighbors at distance l we want to generate.
l (int) – The distance from split of the neighbors we would like to generate
max_neighbors (int) – The maximum number of neighbors to explore

Returns: A generator over the neighbors at distance l from split
Return type: Generator[FactoredSplit]

shake(idxs, split, k)[source]

Summary:: Return a random split from the neighborhood around split.

Parameters

idx – The index o
split (FactoredSplit) – The current split around which we will select a random neighbor
k (int) – The distance from the split form which our new split, obtained by shaking will be drawn from. Kind of like a shake distance

Returns: The randomly selected split from the neighborhood of split
Return type: FactoredSplit

draw_random_node_cut()[source]

Summary:: Draw random, non-adjacenent verticies as source and target nodes, and compute the minimum st-vertex cut. This cut may result in > 2 components. In this case, randomly assign the components to different splits.

Returns: the split of components
Return type: Split

make_overlapping_test_sets(split)[source]

Summary:

Takes a split of the dataset, i.e., two subsets of the dataset that do not overlap in the specified factors, and from the remaining data in the dataset not included in the split, creates multiple test sets that have some overlap with respect to one or more factors in the first of the two subsets in split.

In general, there are 2^N different kinds of overlap when using N factors. By overlap, we mean factors that are considered under the similarity function used to create the graph. We can use the factors specific graphs for this purpose.

Parameters: split (Split) – The split (i.e., two subsets of the data sets) with no factor overlap with respect to which we are making the additional test sets.

Returns: Test sets
Return type: List[set]

overlap_stats(s1, s2)[source]

Summary:: Compute the overlap s2 w/r s1 “stats” associated with each factor.

Parameters

s1 (set) – The set with respect to which overlap will be computed
s2 (set) – the set whose overlap is computed with respect to s1

Returns: The dictionary of factors overlaps (s2 w/r to s1)
Return type: dict

nachos.data.Data.collapse_factored_split(split)[source]

Summary:: Take a FactoredSplit and collapse it by intersecting all the selected set, and intersecting all of their complements to create a single selected set and a single other split with no overlap in any of the factors present in the selected set.

Parameters: split (FactoredSplit) – The split to collapse

Returns: the collapsed split
Return type: Split

nachos.data.Input module

class nachos.data.Input.TSVLoader[source]

Bases: object

static load(fname, config)[source]

class nachos.data.Input.PandasLoader[source]

Bases: object

__init__()[source]

class nachos.data.Input.LhotseLoader[source]

Bases: object

__init__()[source]

nachos.data package

Submodules

nachos.data.Data module

nachos.data.Input module

Module contents