我在一个网络(pg-network)上的 docker 中运行 pgdatabase 和 pgadmin 服务,然后将数据提取到数据库中。下面给出了 docker-compose 和 Dockerfile 的代码以及摄取代码。
我执行的命令如下。
文件
docker-compose.yml
services:
pgdatabase:
image: postgres:13
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=ny_taxi
volumes:
- "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
ports:
- "5432:5432"
pgadmin:
image: dpage/pgadmin4
environment:
- [email protected]
- PGADMIN_DEFAULT_PASSWORD=root
volumes:
- "./pgadmin_conn_data:/var/lib/pgadmin:rw"
ports:
- "8080:80"
Dockerfile
FROM python:3.9
RUN apt-get install wget
RUN pip install pandas==2.1.2 sqlalchemy==2.0.23 pyarrow==8.0.0 psycopg2==2.9.5 psycopg2-binary==2.9.5
WORKDIR /app
COPY ingest_data.py ingest_data.py
ENTRYPOINT [ "python", "ingest_data.py" ]
命令:
docker build -t test .
摄取.py
#!/usr/bin/env python
# coding: utf-8
import os
import argparse
from time import time
import pandas as pd
from sqlalchemy import create_engine
def ingest_data(user, password, host, port, db, table_name, csv_url):
# the backup files are gzipped, and it's important to keep the correct extension
# for pandas to be able to open the file
if csv_url.endswith('.csv.gz'):
csv_name = 'yellow_tripdata_2021-01.csv.gz'
else:
csv_name = 'output.csv'
os.system(f"wget {csv_url} -O {csv_name}")
postgres_url = f'postgresql://{user}:{password}@{host}:{port}/{db}'
engine = create_engine(postgres_url)
df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000)
df = next(df_iter)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')
df.to_sql(name=table_name, con=engine, if_exists='append')
while True:
try:
t_start = time()
df = next(df_iter)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.to_sql(name=table_name, con=engine, if_exists='append')
t_end = time()
print('inserted another chunk, took %.3f second' %
(t_end - t_start))
except StopIteration:
print("Finished ingesting data into the postgres database")
break
if __name__ == '__main__':
user = "root"
password = "root"
host = "pgdatabase"
port = "5432"
db = "ny_taxi"
table_name = "yellow_taxi_trips"
csv_url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"
ingest_data(user, password, host, port, db, table_name, csv_url)
运行后
docker-compose up
转到http://localhost:8080/
用户名:[电子邮件受保护]
密码:root
使用主机名设置服务器:pgdatabase
然后运行
python ingest.py
将数据提取到 postgres 数据库中,但我收到错误 sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "pgdatabase" to address: Unknown host
。
它可能正在验证主机名并期待一些“。”对于 IP 或 TLD。是否可以为您的连接使用完整的数据库连接字符串。
postgres://root:root@pgdatabase:5432/ny_taxi
如果 python 文件不是从容器之一运行,那么主机可能应该是“127.0.0.1”或“localhost”,因为您的主机笔记本电脑无法使用 docker-compose 命名空间,仅在容器内运行的进程。
最后一种可能性是向每个服务添加 container_name: 属性,这对我来说有一两次不同。