nachos.data package
Submodules
nachos.data.Data module
- class nachos.data.Data.Data(id, factors, field_names=None)[source]
Bases:
object
- Summary:
A structure to store the factors (including those that will be used as constraints) associated with records in a tsv file, dataframe, or lhotse manifest.
- class nachos.data.Data.Dataset(data, factor_idxs, constraint_idxs)[source]
Bases:
object
Summary: A class to store and manipulate the data and their associated factors and constraints. The structure we ultimately want is similar to an inverted index.
- factors = [
- {
factor1_value1: [fid1, fid2, …], factor1_value2: [fids, …], …
}, {
factor2_value1: […], factor2_value2: […],
]
- subset_from_data(d)[source]
- Summary:
Create a new subset, with the same factors and constraints as self, from a subset of the data points.
- Parameters
d (Iterable[Data]) – The data points from which to create a Dataset
- Returns
A Dataset object representing the subset of points
- Return type
- subset_from_records(r)[source]
- Summary:
Create a new subset, with the same factors and constraints as self, from a subset of the data points.
- Return type
- check_complete()[source]
- Summary:
Checks if the graph is complete :return: True if complete, False otherwise :rtype: bool
- Return type
bool
- check_disconnected()[source]
- Summary:
Checks if the graph if there are M > 1 disconnected components in the graph.
- Returns
True is disconnected, False otherwise
- Return type
bool
- make_graph(simfuns)[source]
- Summary:
Makes the graph representation of the dataset. This assumes that the graph is undirected, an assumption which we may later break, depending on the kinds of similarity functiosn we will ultimately support. It also makes subgraphs corresponding to each individual factor value. This is like the inverted index. You can lookup the neighbors of a factors. The graph and factors are stored in self.graph and self.graphs respectively.
- param simfuns
the similarity functions (1 per factor) used to compare records (i.e., data points)
- type simfuns
nachos.SimilarityFunctions.SimilarityFunctions
- Return type
None
- export_graph(filename)[source]
- Summary:
Exports graph to .gml file which in theory can be read for visualization.
- Parameters
filename (str) – the filename of the .gml file to create
- Returns
None
- Return type
None
- get_constraints(subset=None, n=None)[source]
- Summary:
Returns a generator over the dataset constraints.
- Parameters
subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use
n (Optional[int]) – The constraint index to return. By default it is None, which means to return all the constraints.
- Returns
generator over constraints
- Return type
Generator
- get_factors(subset=None, n=None)[source]
- Summary:
Returns a generator over the dataset factors.
- Parameters
subset (Optional[Iterable] (Default is None) which means use all ids.) – Iterable of subset of ids to use
n (Optional[int]) – The factor index to return. By default it is None, which means to return all factors.
- Returns
generator over factors
- Return type
Generator
- make_constraint_inverted_index()[source]
- Summary:
Sets the inverted index for the constraints. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th constraint.
- Return type
None
- make_factor_inverted_index()[source]
- Summary:
Returns the inverted index for the factors. In other words inverted_index[n] = [value1, value2, …], the set of value seen for the n-th factor. This is really not a particularly useful function, as the inverted index computed in this way only works for the set_intersection similarity method. For other types of similarity, such as cosine distance, self.make_graph() will make a the graphs corresponding to each factor, and is really a better version of the inverted index created in this function.
This function therefore exists mostly to mirror what the make_cosntraint_inverted_index function.
- Return type
None
- draw_random_split_from_factor(n)[source]
- Summary:
Return a set of Data point ids and its complement corresponding to the inclusion of a subset of values selected from the n-th factor into the “training” set. We also return the index of the set from the powerset of values that resulted in the split.
- Parameters
n (int) – the index of the factor in the list self.factor_idxs from which to select
- Returns
The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index
- Return type
Tuple[int, Tuple[set, set]]
- draw_split_from_factor(n, idx)[source]
- Summary:
Like draw_random_split from factor, but draws the split specified by an integer index, idx, which specifies the subset of values from the powerset of values from factor n to use.
- Parameters
n (int) – the index of the factor in the list self.factor_idxs from which to select
idx (int) – The index in the powerset of the subset of values from the n-th factor to use.
- Returns
The tuple of the index of the set from the powerset of values and the datasets corresponding to the random split and it’s complement resulting from that index
- Return type
Tuple[int, Tuple[set, set]]
- draw_random_split()[source]
- Summary:
Applies self.draw_random_split_from_factor() to each factor independently, and returns all of the splits.
- set_random_seed(seed=0)[source]
- Summary:
Set the random seed of the random module
- Parameters
seed (int) – Default to 0. It’s the random module’s random seed
- Return type
None
- nearby_splits(idxs, split)[source]
- Summary:
Make a generator over “neaby splits”. These are splits that are Hamming distance 1 away from the current split. By this we mean if you concatenated the bit strings representing the indices of the powersets of values for each factor, then any bit string that differs in a single value.
- Parameters
idx – The indices into the powersets of the subset corresponding to split
split (FactoredSplit) – a split (a factored split actually) around which we want to find splits that are Hamming distance = 1 away
- Returns
a generator over the neighboring splits
- Return type
Generator[FactoredSplit]
- get_neighborhood(idxs, split, l, max_neighbors=2000)[source]
- Summary:
Return a generator over all of the neighbors at distance l from split.
- Parameters
idxs (List[int]) – The list of indices, for each factor into their respective powersets of the corresponding to the splits
split (FactoredSplit) – The split whose neighbors at distance l we want to generate.
l (int) – The distance from split of the neighbors we would like to generate
max_neighbors (int) – The maximum number of neighbors to explore
- Returns
A generator over the neighbors at distance l from split
- Return type
Generator[FactoredSplit]
- shake(idxs, split, k)[source]
- Summary:
Return a random split from the neighborhood around split.
- Parameters
idx – The index o
split (FactoredSplit) – The current split around which we will select a random neighbor
k (int) – The distance from the split form which our new split, obtained by shaking will be drawn from. Kind of like a shake distance
- Returns
The randomly selected split from the neighborhood of split
- Return type
FactoredSplit
- draw_random_node_cut()[source]
- Summary:
Draw random, non-adjacenent verticies as source and target nodes, and compute the minimum st-vertex cut. This cut may result in > 2 components. In this case, randomly assign the components to different splits.
- Returns
the split of components
- Return type
Split
- make_overlapping_test_sets(split)[source]
- Summary:
Takes a split of the dataset, i.e., two subsets of the dataset that do not overlap in the specified factors, and from the remaining data in the dataset not included in the split, creates multiple test sets that have some overlap with respect to one or more factors in the first of the two subsets in split.
In general, there are 2^N different kinds of overlap when using N factors. By overlap, we mean factors that are considered under the similarity function used to create the graph. We can use the factors specific graphs for this purpose.
- Parameters
split (Split) – The split (i.e., two subsets of the data sets) with no factor overlap with respect to which we are making the additional test sets.
- Returns
Test sets
- Return type
List[set]
- overlap_stats(s1, s2)[source]
- Summary:
Compute the overlap s2 w/r s1 “stats” associated with each factor.
- Parameters
s1 (set) – The set with respect to which overlap will be computed
s2 (set) – the set whose overlap is computed with respect to s1
- Returns
The dictionary of factors overlaps (s2 w/r to s1)
- Return type
dict
- nachos.data.Data.collapse_factored_split(split)[source]
- Summary:
Take a FactoredSplit and collapse it by intersecting all the selected set, and intersecting all of their complements to create a single selected set and a single other split with no overlap in any of the factors present in the selected set.
- Parameters
split (FactoredSplit) – The split to collapse
- Returns
the collapsed split
- Return type
Split