dnarecords.reader

DNARecords available readers.

Module Contents

Classes

DNARecordsReader

DNARecords Tensorflow reader. Sample and variant wise.

DNASparkReader

DNARecords Spark reader. Sample and variant wise.

class dnarecords.reader.DNARecordsReader(dnarecords_path, gzip=True)[source]

DNARecords Tensorflow reader. Sample and variant wise.

Genomics data ML ready for frameworks like Tensorflow or Pytorch.

  • Consume the data in a variant wise fashion (common GWAS analysis).

  • Or consume the data in a sample wise fashion (Deep Learning models).

  • Tested on UKBB.

Example

import dnarecords as dr
import tensorflow as tf

output = '/tmp/dnarecords/output'
reader = dr.reader.DNARecordsReader('/tmp/dnarecords/output')

swds = reader.sample_wise_dataset()
tf.print(next(iter(swds)))

vwds = reader.variant_wise_dataset()
tf.print(next(iter(vwds)))
{'key': 191,
 'chr1': 'SparseTensor(indices=[[0] [1] [2] ... [924] [925] [926]],
                       values=[0.200760081 0.200760037 0.200760067 ... 0.0019912892 1.96934652 0.00396528561],
                       shape=[909])',
 'chr10': 'SparseTensor(indices=[[124] [125] [126] ... [665] [666] [667]],
                        values=[1.01560163 0.0306534301 1.99800873 ... 0.999999881 1.01956224 0.111815773],
                        shape=[532])',
  ... }
 ...

{'key': 3764,
 'tensor': 'SparseTensor(indices=[[0] [1] [2] ... [281] [282] [283]],
                         values=[0.111815773 0.015601662 0.00788068399 ... 0.0593509413 0.000500936178],
                         shape=[10880])'}
...
Parameters
  • dnarecords_path – root path to your DNARecords created with DNARecordsWriter.write

  • gzip – whether your tfrecords are gzipped or not. Default: True.

metadata(self, vkeys_columns: List[str] = None, skeys_columns: List[str] = None, taste: bool = False) Dict[str, DataFrame][source]

Gets the metadata associated to the DNARecords dataset as a dictionary of names to pandas DataFrames.

Return type

Dict[str, DataFrame].

Returns

the metadata associated to the DNARecords as a dictionary of names to pandas DataFrames.

Parameters
  • vkeys_columns – columns to return from variant metadata files (potentially big files). Defaults to None (all columns).

  • skeys_columns – columns to return from sample metadata files (potentially big files). Defaults to None (all columns).

  • taste – The full metadata DataFrames could be huge, wo you can get a taste of them without going into memory issues. With that, decide wich columns to get metadata for. Defaults to False.

datafiles(self) Dict[str, List[str]][source]

Gets the paths of the DNARecords dataset files as a dictionary of names to List of paths.

Return type

Dict[str, List[str]].

Returns

the dnarecord files associated to the DNARecords as a dictionary of names to List of paths.

sample_wise_dataset(self, num_parallel_reads: int = - 1, num_parallel_calls: int = - 1, deterministic: bool = False, drop_remainder: bool = False, batch_size: int = None, buffer_size: int = None) Dataset[source]

DNARecords Tensorflow reader in a sample wise fashion.

Specially intended for Deep Learning models.

Returns

a Tensorflow dataset with the sample wise DNARecords genomics data.

Parameters
  • num_parallel_reads – tf.data.TFRecordDataset equivalent parameter.

  • num_parallel_calls – tf.data.TFRecordDataset equivalent parameter.

  • deterministic – tf.data.TFRecordDataset equivalent parameter.

  • drop_remainder – tf.data.TFRecordDataset equivalent parameter.

  • batch_size – tf.data.TFRecordDataset equivalent parameter.

  • buffer_size – tf.data.TFRecordDataset equivalent parameter.

Return type

tf.data.Dataset.

variant_wise_dataset(self, num_parallel_reads: int = - 1, num_parallel_calls: int = - 1, deterministic: bool = False, drop_remainder: bool = False, batch_size: int = None, buffer_size: int = None) Dataset[source]

DNARecords Tensorflow reader in a variant wise fashion.

Specially intended for GWAS analysis.

Parameters
  • num_parallel_reads – tf.data.TFRecordDataset equivalent parameter.

  • num_parallel_calls – tf.data.TFRecordDataset equivalent parameter.

  • deterministic – tf.data.TFRecordDataset equivalent parameter.

  • drop_remainder – tf.data.TFRecordDataset equivalent parameter.

  • batch_size – tf.data.TFRecordDataset equivalent parameter.

  • buffer_size – tf.data.TFRecordDataset equivalent parameter.

Returns

a Tensorflow dataset with the variant wise DNARecords genomics data.

Return type

tf.data.Dataset.

class dnarecords.reader.DNASparkReader(dnarecords_path)[source]

DNARecords Spark reader. Sample and variant wise.

Provides methods to read a previously created DNARecords dataset as pyspark DataFrame objects.

Review DNARecordsUtils.dnarecords_tree to know the structure of a previously created DNARecords dataset.

For each of these directories (when they exist, depending on the configuration used at DNARecordsWriter.write), a spark DataFrame can be returned.

Example

import dnarecords as dr

output = '/tmp/dnarecords/output'
reader = dr.reader.DNASparkReader(output)

reader.sample_wise_dnarecords().show(2)

reader.variant_wise_dnarecords().show(2)
+---+--------------------+--------------------+----------------+
|key|        chr1_indices|         chr1_values|chr1_dense_shape| ...
+---+--------------------+--------------------+----------------+
| 26|[0, 2, 4, 5, 6, 7...|[0.33607214002352...|             909| ...
| 29|[0, 1, 2, 3, 4, 5...|[0.20076008098505...|             909| ...
+---+--------------------+--------------------+----------------+
only showing top 1 row

+--------------------+--------------------+----+-----------+
|             indices|              values| key|dense_shape|
+--------------------+--------------------+----+-----------+
|[0, 1, 2, 3, 4, 5...|[0.9984177, 0.007...|3506|      10880|
|[0, 1, 2, 3, 4, 5...|[0.11181577, 0.01...|3764|      10880|
+--------------------+--------------------+----+-----------+
only showing top 2 rows

...
Return type

Dict[str, DataFrame]

Parameters

dnarecords_path – path to the DNARecords root directory.

Returns

A dictionary with DataFrame values corresponding to the generated outputs.

metadata(self) Dict[str, DataFrame][source]

Gets the metadata associated to the DNARecords dataset as a dictionary of names to pyspark.sql DataFrames.

Return type

Dict[str, DataFrame].

Returns

the metadata associated to the DNARecords as a dictionary of names to pyspark.sql DataFrames.

sample_wise_dnarecords(self) DataFrame[source]

Gets a pyspark Dataframe from sample wise DNARecords (when created as tfrecords).

Return type

DataFrame.

Returns

a pyspark Dataframe from sample wise DNARecords.

variant_wise_dnarecords(self) DataFrame[source]

Gets a pyspark Dataframe from variant wise DNARecords (when created as tfrecords).

Return type

DataFrame.

Returns

a pyspark Dataframe from variant wise DNARecords.

sample_wise_dnaparquet(self) DataFrame[source]

Gets a pyspark Dataframe from sample wise DNARecords (when created as parquet).

Return type

DataFrame.

Returns

a pyspark Dataframe from sample wise DNARecords.

variant_wise_dnaparquet(self) DataFrame[source]

Gets a pyspark Dataframe from variant wise DNARecords (when created as parquet).

Return type

DataFrame.

Returns

a pyspark Dataframe from variant wise DNARecords.