我有一个具有以下结构的表格(这是缩写的):
class Valuation(Base):
__tablename__ = 'valuation'
id = Column(Integer, primary_key=True)
reference = Column(BigInteger, index=True)
value = Column(Float)
period = Column(String)
示例数据:
参考 | 价值 | 期 |
---|---|---|
2433 | 110 | 2023-a |
5435 | 120 | 2023-b |
5435 | 110 | 2022-a |
2433 | 100 | 2022-b |
5435 | 105 | 2022-c |
2433 | 100 | 2021-a |
数据注意事项:
value
应随时间减少或保持不变,因此任何周期的最大值应小于之前周期的最大值。我想选择每个参考,其中该参考的最近周期值大于任何先前周期的任何最新值。
在上面,将返回:
参考 | 价值 | 期 |
---|---|---|
2433 | 110 | 2023-a |
5435 | 120 | 2023-b |
回顾this,表明
aliased
方法会有所帮助,但我对如何最好地构建它有点不知所措。
到目前为止我在哪里:
value2022 = aliased(Valuation, name="value2022")
value2021 = aliased(Valuation, name="value2021")
query = (
db.query(Valuation)
.outerjoin(value2022, (
(Valuation.reference == value2022.reference)
& (Valuation.value > value2022.value)
& (Valuation.period.startswith("2023"))
& (value2022.period.startswith("2022"))
)
)
.outerjoin(value2021, (
(Valuation.reference == value2021.reference)
& (Valuation.value > value2021.value)
& (Valuation.period.startswith("2023"))
& (value2021.period.startswith("2021"))
)
)
.order_by(
Valuation.reference,
Valuation.period.desc(),
)
.distinct(Valuation.reference)
.all()
)
但是,这并没有给我最新时期的值与之前每个时期的最新值的比较,并且似乎严重过度拟合。这可以吗?
我已经很长时间没有使用
sqlalchemy
了,但也许这会有用。
from sqlalchemy import create_engine, Column, Integer, BigInteger, Float, String, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import pandas as pd
engine = create_engine('sqlite://')
Base = declarative_base(engine)
Session = sessionmaker(bind=engine)
session = Session()
class Valuation(Base):
__tablename__ = 'valuation'
id = Column(Integer, primary_key=True)
reference = Column(BigInteger, index=True)
value = Column(Float)
period = Column(String)
Base.metadata.create_all(engine)
for valuation in [
Valuation(reference=2433, value=110, period='2023-a'),
Valuation(reference=5435, value=120, period='2023-b'),
Valuation(reference=5435, value=110, period='2022-a'),
Valuation(reference=2433, value=100, period='2022-b'),
Valuation(reference=5435, value=105, period='2022-c'),
Valuation(reference=2433, value=100, period='2021-a'),
]:
session.add(valuation)
session.commit()
with engine.connect() as con:
cursor = con.execute(text("""
-- max by reference
WITH r AS (
SELECT reference,
max(value) AS value
FROM valuation
GROUP BY reference
)
SELECT r.*, p.period
FROM r
JOIN (
SELECT reference,
period,
value
FROM valuation
) AS p ON (p.reference = r.reference AND p.value = r.value)
"""))
print('result using sql:')
print(cursor.all())
# also you can use pandas
df = pd.read_sql_query("""
SELECT reference,
period,
max(value) AS value
FROM valuation
GROUP BY reference, period
ORDER BY value DESC
""", con=con.connection)
print('\ndataframe from database\n')
print(df)
print('\ndataframe after deduplication\n')
print(df.drop_duplicates(['reference']))
我们奔跑吧:
result using sql:
[(2433, 110.0, '2023-a'), (5435, 120.0, '2023-b')]
dataframe from database
reference period value
0 5435 2023-b 120.0
1 2433 2023-a 110.0
2 5435 2022-a 110.0
3 5435 2022-c 105.0
4 2433 2021-a 100.0
5 2433 2022-b 100.0
dataframe after deduplication
reference period value
0 5435 2023-b 120.0
1 2433 2023-a 110.0
Process finished with exit code 0