dnarecords.writer

DNARecords available writers.

Module Contents

Classes

DNARecordsWriter

Genomics data (vcf, bgen, etc.) to tfrecords or parquet. Sample and variant wise.

class dnarecords.writer.DNARecordsWriter(expr: Expression, block_size: int = int(1000000.0), staging: str = '/tmp/dnarecords/staging')[source]

Genomics data (vcf, bgen, etc.) to tfrecords or parquet. Sample and variant wise.

Core class to go from genomics data to tfrecords or parquet files ready to use with Deep Learning frameworks like Tensorflow or Pytorch.

  • Able to generate dna records variant wise or sample wise (i.e. transposing the matrix).

  • Takes advantage of sparsity (very convenient to save space and computation, specially with Deep Learning).

  • Scales automatically to any sized dataset. Tested on UKBB.

Example

import dnarecords as dr

hl = dr.helper.DNARecordsUtils.init_hail()
hl.utils.get_1kg('/tmp/1kg')
mt = hl.read_matrix_table('/tmp/1kg/1kg.mt')
mt = mt.annotate_entries(dosage=hl.pl_dosage(mt.PL))

output = '/tmp/dnarecords/output'
writer = dr.writer.DNARecordsWriter(mt.dosage)
writer.write(output, sparse=True, sample_wise=True, variant_wise=True,
             tfrecord_format=True, parquet_format=True,
             write_mode='overwrite', gzip=True)

reader = dr.reader.DNASparkReader(output)

reader.sample_wise_dnarecords().show(2)

reader.variant_wise_dnarecords().show(2)
+---+--------------------+--------------------+----------------+
|key|        chr1_indices|         chr1_values|chr1_dense_shape| ...
+---+--------------------+--------------------+----------------+
| 26|[0, 2, 4, 5, 6, 7...|[0.33607214002352...|             909| ...
| 29|[0, 1, 2, 3, 4, 5...|[0.20076008098505...|             909| ...
+---+--------------------+--------------------+----------------+
only showing top 1 row

+--------------------+--------------------+----+-----------+
|             indices|              values| key|dense_shape|
+--------------------+--------------------+----+-----------+
|[0, 1, 2, 3, 4, 5...|[0.9984177, 0.007...|3506|      10880|
|[0, 1, 2, 3, 4, 5...|[0.11181577, 0.01...|3764|      10880|
+--------------------+--------------------+----+-----------+
only showing top 2 rows

...
Parameters
  • expr – a Hail expression. Currently, ony expressions coercible to numeric are supported

  • block_size – entries per block in the internal operations

  • staging – path to staging directory to use for intermediate data. Default: /tmp/dnarecords/staging.

write(self, output: str, sparse: bool = True, sample_wise: bool = True, variant_wise: bool = False, tfrecord_format: bool = True, parquet_format: bool = False, write_mode: str = 'error', gzip: bool = True) None[source]

DNARecords spark writer.

Writes a DNARecords dataset based on the Hail expr provided in the class constructor.

Return type

Dict[str, DataFrame]

Parameters
  • output – path to the output location of the DNARecords.

  • sparse – generate sparse data (filtering out any zero values). Default: True.

  • sample_wise – generate DNARecords in a sample wise fashion (i.e. transposing the matrix, one column -> one record). Default: True.

  • variant_wise – generate DNARecords in a variant wise fashion (i.e. one row -> one record) Default: False.

  • tfrecord_format – generate tfrecords output files. Default: True.

  • parquet_format – generate parquet output files. Default: False.

  • write_mode – spark write mode parameter (‘error’, ‘overwrite’, etc.). Default: ‘error’.

  • gzip – gzip the output files. Default: True.

Returns

A dictionary with DataFrames for each generated output.