dnarecords.reader
DNARecords available readers.
Module Contents
Classes
DNARecords Tensorflow reader. Sample and variant wise. |
|
DNARecords Spark reader. Sample and variant wise. |
- class dnarecords.reader.DNARecordsReader(dnarecords_path, gzip=True)[source]
DNARecords Tensorflow reader. Sample and variant wise.
Genomics data ML ready for frameworks like Tensorflow or Pytorch.
Consume the data in a variant wise fashion (common GWAS analysis).
Or consume the data in a sample wise fashion (Deep Learning models).
Tested on UKBB.
Example
import dnarecords as dr import tensorflow as tf output = '/tmp/dnarecords/output' reader = dr.reader.DNARecordsReader('/tmp/dnarecords/output') swds = reader.sample_wise_dataset() tf.print(next(iter(swds))) vwds = reader.variant_wise_dataset() tf.print(next(iter(vwds)))
{'key': 191, 'chr1': 'SparseTensor(indices=[[0] [1] [2] ... [924] [925] [926]], values=[0.200760081 0.200760037 0.200760067 ... 0.0019912892 1.96934652 0.00396528561], shape=[909])', 'chr10': 'SparseTensor(indices=[[124] [125] [126] ... [665] [666] [667]], values=[1.01560163 0.0306534301 1.99800873 ... 0.999999881 1.01956224 0.111815773], shape=[532])', ... } ... {'key': 3764, 'tensor': 'SparseTensor(indices=[[0] [1] [2] ... [281] [282] [283]], values=[0.111815773 0.015601662 0.00788068399 ... 0.0593509413 0.000500936178], shape=[10880])'} ...
- Parameters
dnarecords_path – root path to your DNARecords created with
DNARecordsWriter.write
gzip – whether your tfrecords are gzipped or not. Default: True.
See also
- metadata(self, vkeys_columns: List[str] = None, skeys_columns: List[str] = None, taste: bool = False) Dict[str, DataFrame] [source]
Gets the metadata associated to the DNARecords dataset as a dictionary of names to pandas DataFrames.
- Return type
Dict[str, DataFrame].
- Returns
the metadata associated to the DNARecords as a dictionary of names to pandas DataFrames.
- Parameters
vkeys_columns – columns to return from variant metadata files (potentially big files). Defaults to None (all columns).
skeys_columns – columns to return from sample metadata files (potentially big files). Defaults to None (all columns).
taste – The full metadata DataFrames could be huge, wo you can get a taste of them without going into memory issues. With that, decide wich columns to get metadata for. Defaults to False.
See also
- datafiles(self) Dict[str, List[str]] [source]
Gets the paths of the DNARecords dataset files as a dictionary of names to List of paths.
- Return type
Dict[str, List[str]].
- Returns
the dnarecord files associated to the DNARecords as a dictionary of names to List of paths.
See also
- sample_wise_dataset(self, num_parallel_reads: int = - 1, num_parallel_calls: int = - 1, deterministic: bool = False, drop_remainder: bool = False, batch_size: int = None, buffer_size: int = None) Dataset [source]
DNARecords Tensorflow reader in a sample wise fashion.
Specially intended for Deep Learning models.
- Returns
a Tensorflow dataset with the sample wise DNARecords genomics data.
- Parameters
num_parallel_reads – tf.data.TFRecordDataset equivalent parameter.
num_parallel_calls – tf.data.TFRecordDataset equivalent parameter.
deterministic – tf.data.TFRecordDataset equivalent parameter.
drop_remainder – tf.data.TFRecordDataset equivalent parameter.
batch_size – tf.data.TFRecordDataset equivalent parameter.
buffer_size – tf.data.TFRecordDataset equivalent parameter.
- Return type
tf.data.Dataset.
- variant_wise_dataset(self, num_parallel_reads: int = - 1, num_parallel_calls: int = - 1, deterministic: bool = False, drop_remainder: bool = False, batch_size: int = None, buffer_size: int = None) Dataset [source]
DNARecords Tensorflow reader in a variant wise fashion.
Specially intended for GWAS analysis.
- Parameters
num_parallel_reads – tf.data.TFRecordDataset equivalent parameter.
num_parallel_calls – tf.data.TFRecordDataset equivalent parameter.
deterministic – tf.data.TFRecordDataset equivalent parameter.
drop_remainder – tf.data.TFRecordDataset equivalent parameter.
batch_size – tf.data.TFRecordDataset equivalent parameter.
buffer_size – tf.data.TFRecordDataset equivalent parameter.
- Returns
a Tensorflow dataset with the variant wise DNARecords genomics data.
- Return type
tf.data.Dataset.
- class dnarecords.reader.DNASparkReader(dnarecords_path)[source]
DNARecords Spark reader. Sample and variant wise.
Provides methods to read a previously created DNARecords dataset as pyspark DataFrame objects.
Review
DNARecordsUtils.dnarecords_tree
to know the structure of a previously created DNARecords dataset.For each of these directories (when they exist, depending on the configuration used at
DNARecordsWriter.write
), a spark DataFrame can be returned.Example
import dnarecords as dr output = '/tmp/dnarecords/output' reader = dr.reader.DNASparkReader(output) reader.sample_wise_dnarecords().show(2) reader.variant_wise_dnarecords().show(2)
+---+--------------------+--------------------+----------------+ |key| chr1_indices| chr1_values|chr1_dense_shape| ... +---+--------------------+--------------------+----------------+ | 26|[0, 2, 4, 5, 6, 7...|[0.33607214002352...| 909| ... | 29|[0, 1, 2, 3, 4, 5...|[0.20076008098505...| 909| ... +---+--------------------+--------------------+----------------+ only showing top 1 row +--------------------+--------------------+----+-----------+ | indices| values| key|dense_shape| +--------------------+--------------------+----+-----------+ |[0, 1, 2, 3, 4, 5...|[0.9984177, 0.007...|3506| 10880| |[0, 1, 2, 3, 4, 5...|[0.11181577, 0.01...|3764| 10880| +--------------------+--------------------+----+-----------+ only showing top 2 rows ...
- Return type
Dict[str, DataFrame]
- Parameters
dnarecords_path – path to the DNARecords root directory.
- Returns
A dictionary with DataFrame values corresponding to the generated outputs.
See also
- metadata(self) Dict[str, DataFrame] [source]
Gets the metadata associated to the DNARecords dataset as a dictionary of names to pyspark.sql DataFrames.
- Return type
Dict[str, DataFrame].
- Returns
the metadata associated to the DNARecords as a dictionary of names to pyspark.sql DataFrames.
See also
- sample_wise_dnarecords(self) DataFrame [source]
Gets a pyspark Dataframe from sample wise DNARecords (when created as tfrecords).
- Return type
DataFrame.
- Returns
a pyspark Dataframe from sample wise DNARecords.
- variant_wise_dnarecords(self) DataFrame [source]
Gets a pyspark Dataframe from variant wise DNARecords (when created as tfrecords).
- Return type
DataFrame.
- Returns
a pyspark Dataframe from variant wise DNARecords.