我能够访问 http url 并检索目录列表。然后我逐行检查每个网址是否有
.txt
扩展名并使用 requests.content
访问它并解码 txt 文件。
不过,我希望能够根据日期过滤列表。目录列表如下:
<HTML><HEAD><TITLE>IP Address/Log</TITLE>
<BODY>
<H1>Log</H1><HR>
<PRE><A HREF="/Main/">[To Parent Directory]</A>
23/11/18 19:07 314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
23/11/16 23:59 150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
23/11/28 15:22 450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
23/11/17 0:00 450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
23/11/16 23:59 110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
</PRE><HR></BODY></HTML>
我只对包含数据、时间大小和 HREF 链接的行感兴趣。我想创建一个数据框,其中第一列是日期,第二列是时间,第三列是大小,第四列是链接。 要访问每个链接,我使用以下代码:
for line in lines:
if ".txt" in line:
filename = line.split('"')[1]
if filename.startswith(file_prefix_all) and filename.endswith(".txt"):
file_url = url_root + filename
print(file_url)
file_response = requests.get(file_url, auth=auth)
if file_response.status_code == 200:
# Read the CSV content into a Pandas DataFrame
file_content = file_response.content.decode('utf-8')
df = pd.read_csv(StringIO(file_content), encoding='utf-8', sep='\t')
files_dataframes.append(df)
将列表放入数据框中后,我可以使用相同的方法吗? 任何帮助/建议将不胜感激!
可以使用正则表达式来解析文本,例如:
import re
import pandas as pd
text = """\
<HTML><HEAD><TITLE>IP Address/Log</TITLE>
<BODY>
<H1>Log</H1><HR>
<PRE><A HREF="/Main/">[To Parent Directory]</A>
23/11/18 19:07 314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
23/11/16 23:59 150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
23/11/28 15:22 450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
23/11/17 0:00 450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
23/11/16 23:59 110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
</PRE><HR></BODY></HTML>"""
df = pd.DataFrame(
map(
re.Match.groupdict,
re.finditer(
r'(?P<date>\d+/\d+/\d+).*?(?P<time>\d+:\d+).*?(?P<size>\d+).*?HREF="(?P<filename>[^"]+)"',
text,
),
)
)
print(df)
打印:
date time size filename
0 23/11/18 19:07 314 /Log/Alarm_231118.txt
1 23/11/16 23:59 150516 /Log/Temperature%20Detail_Data%20Log_231116.txt
2 23/11/28 15:22 450 /Log/Alarm_231128.txt
3 23/11/17 0:00 450536 /Log/Temperature%20Detail_Data%20Log.log
4 23/11/16 23:59 110148 /Log/Water%20Temp%20Trend_Data%20Log_231116.txt
然后过滤数据框:
print(df[df.filename.str.endswith(".txt")])
打印:
date time size filename
0 23/11/18 19:07 314 /Log/Alarm_231118.txt
1 23/11/16 23:59 150516 /Log/Temperature%20Detail_Data%20Log_231116.txt
2 23/11/28 15:22 450 /Log/Alarm_231128.txt
4 23/11/16 23:59 110148 /Log/Water%20Temp%20Trend_Data%20Log_231116.txt