在数据库中存储的大型数据集上训练 pytorch 模型的最佳方法是什么？

Question

我正在学习机器学习。我有一个很大的数据集，为了方便起见，我将其放入 sqlite 数据库中。它有大约270k行，每行有10_000 bp长的DNA序列。一次加载整个数据集是不可能的，更不用说训练模型了（是的，我尝试过，但我笔记本电脑的 GPU 没有运行/发出任何噪音）。

所以目前我正在运行一个循环。在每次迭代中，我使用偏移量和限制（例如一次 500 个序列）从数据库中选择分页数据，并使用这 500 个序列（例如 10 个时期）对模型进行训练。

我的代码的相关部分：

my_offset = 0
my_limit = 500
my_batch_size = 100

some_model = MyModel()

db_page_number = 0

# pagination
while db_page_number < 100:
  db_page_number += 1
  my_offset += my_limit

  query = f"SELECT sequences, classification FROM table ORDER BY id LIMIT {my_limit} OFFSET {my_offset}"
  paged_data = pd.read_sql_query(query)
  
  x, y = get_x_y_from(paged_data)
  train_dataset = torch.utils.data.TensorDataset(x, y)
  train_loader = DataLoader(train_dataset, batch_size=my_batch_size)
  for epoch in range(0, 10):
    # ... typical pytorch code...
    for data in train_loader:
      x1, y1 = data
      predicted_outputs = self.pytorch_model(x1)  # predict output from the model
      train_loss = self.loss_function(predicted_outputs, y1)  # calculate loss for the predicted output
      train_loss.backward()  # back propagate the loss
      optimizer.step()  # adjust params based on the calculated gradients
      # ... typical pytorch code...

（如果您需要更多代码，请在评论中告诉我。）

在这里，我手动进行分页（最外层的 while 循环）。我设法运行我的代码，但我想知道最佳实践是什么？也许使用 pandas 或其他一些库来进行分页部分？我愿意接受建议。

Answer 1

一个选项是拥有一个“pagination_control”表，从中可以获得限制和偏移量。

所以有两个主列（例如

p_limit

和

p_offset

）。

这可以促进灵活的方法，因为可以更改限制和偏移量以适应需要。然而，为了方便下一个块/组/行集。提取选择后，“paginataion_control”表的更新可以通过添加偏移量限制来为下一个块/组/序列集做好准备。

演示

您可能希望考虑以下演示：-

/* Cleanup demo environment just in case */
DROP TABLE IF EXISTS `table`;
DROP TABLE IF EXISTS pagination_control;
/* Create the core table */
CREATE TABLE IF NOT EXISTS `table` (id INTEGER PRIMARY KEY, sequences TEXT, classification TEXT, other_if_any TEXT DEFAULT 'ooops');
/* load the core table with some 300000 rows (after doing playing around) */
WITH 
    /* Create a CTE (temp table) with the core sequence indentifiers */
    /* note very little knowledge of DNA */
    sequences(seq) AS (
        SELECT 'A' UNION ALL SELECT 'B' UNION ALL SELECT 'C' UNION ALL SELECT 'G' UNION ALL SELECT 'T'
    ),
    /* create CTE with groups of indentifiers (which could potentially compress the stored sequences (intended just as a hint)) */
    /* i.e. all permutations (3125) of 5 identifiers */
    grouped_sequences_by_5 AS (
    SELECT DISTINCT 
        groupof5.seq ||s2.seq||s3.seq||s4.seq||s5.seq AS groupof5
    FROM sequences AS groupof5 
        JOIN sequences AS s2 
        JOIN sequences AS s3 
        JOIN sequences AS s4 
        JOIN sequences AS s5
    )
    ,
    /* Create another CTE of random rows (300000 rows) */
    ready_to_insert(sequences,classification) AS (
        SELECT 
            (SELECT groupof5 FROM grouped_sequences_by_5 ORDER BY random() LIMIT 1 ),
            'CAT'||(abs(random()) % 10)
        UNION ALL SELECT 
            (SELECT groupof5 FROM grouped_sequences_by_5 ORDER BY random() LIMIT 1 ),
            'CAT'||(abs(random()) % 10)
        FROM ready_to_insert LIMIT 300000
    )
INSERT INTO `table` (sequences,classification) SELECT * FROM ready_to_insert;
SELECT * FROM `table`;

/*----------------------------------------*/
/* Now demonstrate the pagination_control */
CREATE TABLE IF NOT EXISTS pagination_control (id INTEGER PRIMARY KEY,p_limit,p_offset);
/* initialise pagination table */
INSERT OR REPLACE INTO pagination_control VALUES(1,1000,0);
SELECT * FROM `table` ORDER BY id LIMIT (SELECT p_limit FROM pagination_control) OFFSET (SELECT p_offset FROM pagination_control);
/* Always update after selection to ready for next block */
UPDATE pagination_control SET p_offset = p_offset + p_limit;
SELECT * FROM `table` ORDER BY id LIMIT (SELECT p_limit FROM pagination_control) OFFSET (SELECT p_offset FROM pagination_control);
/* Always update after selection to ready for next block (again) */
UPDATE pagination_control SET p_offset = p_offset + p_limit;
/* optional to alter e.g. set blocks to 500 rows per block */
UPDATE pagination_control SET p_limit = 500;
/* and so on */
SELECT * FROM `table` ORDER BY id LIMIT (SELECT p_limit FROM pagination_control) OFFSET (SELECT p_offset FROM pagination_control);
UPDATE pagination_control SET p_offset = p_offset + p_limit;
SELECT * FROM `table` ORDER BY id LIMIT (SELECT p_limit FROM pagination_control) OFFSET (SELECT p_offset FROM pagination_control);
UPDATE pagination_control SET p_offset = p_offset + p_limit;
/* Cleanup the demo environment */
DROP TABLE IF EXISTS pagination_control;
DROP TABLE IF EXISTS `table`;

首先插入 300000 行（显然只是为了演示 pagination_control）
- 您可能希望考虑如何使用分组序列来减少总体数据，例如上面每 5 个字节可以用 2 个字节表示（
```
grouped_sequences_by_5
```
  CTE 的 id(rowid)，如果它是一个永久表）。
- 例如结果 1 可能是：-

但是，在上面之后，pagination_control 表才得以演示。

首先它是用两个核心列（

p_limit

和

p_offset

）创建的，id列仅用于维护单行。

以下 SELECT（结果 2）演示了如何使用 pagination_control 来确定选定的行。

以下 UPDATE 通常应立即跟随 SELECT，显示如何为下一个 SELECT 准备分页表（结果 3）。

在第二次选择之后，更新将限制从 1000 更改为 500。第三次选择（结果 4）和第四次选择（结果 5）然后抓取下一个 500 块。

当然，您可以轻松操作“pagination_control”表并进行非常灵活的控制，例如要重置，您可以更新它，使

p_limit

为 1000，

p_offset

为 0，以 1000 阻塞/分页因子重新运行。

在数据库中存储的大型数据集上训练 pytorch 模型的最佳方法是什么？

问题描述投票：0回答：1

1个回答

最新问题

在数据库中存储的大型数据集上训练 pytorch 模型的最佳方法是什么？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1