我正在尝试加载多个格式相似的日志文件,但是一条记录可以位于多行中,我已经做了一些开发,但是分别读取每一行会花费太多时间示例代码如下。
请帮助我
df = pd.DataFrame( {'eventtime':[],
'FileType':[],
'msg_type':[],
'thred_num':[],
'msg_lyr':[],
'message':[],
'colorcode':[]
})
def write_line(record,file_type):
if record=='':
return
split1= record.split("[")
split2 = split1[0].split(" ")
split3 = split1[1].split(" ")
split4 = split1[2]
s_time = split2[1].split(",")
str_dateime = split2[0] + ' ' + s_time[0] + "."+s_time[1]
df.loc[len(df)] = pd.Series( {'eventtime':str_dateime,
'FileType':file_type,
'msg_type':split2[2],
'thred_num':split3[0][:-1],
'msg_lyr':split3[1],
'message':split4,
'colorcode':""
})
for si_file in si_Files:
f= gzip.open(si_file,'rt')
file_content = f.read()
f.close()
for record in file_content .split("]~~\n"):**
write_line(record,'silogs')'''
是否有更好的方法可以将数据逐行加载到DataFrame中以读取文件,然后将其加载到数据帧中,这是资源过多,导致运行主应用程序的应用程序服务器上的资源浪费了
示例日志文件如下
2019-12-02 08:53:26,261 INFO [18] CTL.CONF - [Loading Configurations]~~<br>
2019-12-02 08:53:26,273 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: |User:default|ClientMachine:xxx.xxx.xxx.xxx]~~<br/>
2019-12-02 08:53:26,274 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: Waiting for connection with configuration server: xxx.xxx.xxx.xxx:X000]~~<br/>
2019-12-02 08:53:26,328 INFO [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:26,329 WARN [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:27,334 ERROR [19] GSI.Comms.SC - [Failed to connect with <br/>server|Endpoint:xxx.xxx.xx.xxx:x000|Error:No connection could be made because the target <br/>machine actively refused it xxx.xxx.xx.xxx:x000(ConnectionRefused:10061)]~~<br/>
2019-12-02 08:53:30,340 INFO [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:30,341 WARN [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:30,393 WARN [19] sm.cs.client.CfgInterface - [Start: No QueryReload performed, EngineId missing.]~~<br/>
2019-12-02 08:53:30,393 DEBUG [19] sm.cs.client.CfgInterface - [ClientConnection | Connected to the Server : Primary]~~<br/>
2019-12-02 08:53:30,393 DEBUG [18] sm.cs.client.CfgInterface - [Start: done.]~~<br/>
2019-12-02 08:53:30,512 DEBUG [13] CTL.CONF - [ReloadResponse: Submitted]~~<br/>
**2019-12-02 08:53:31,791 INFO [18] CTL.CONF - [GroupSettingsXml|<Groups><br/>
xml tages EventProcessingLagThreshold 100 /EventProcessingLagThreshold <br/>
XML tages QueueLengthThreshold 100 /QueueLengthThreshold <br/>
xml tags/Groups]~~<br/>**
2019-12-02 08:53:31,803 INFO [18] CTL.CONF - [EventProcessingLagThreshold:100]~~<br/>
2019-12-02 08:53:32,122 INFO [18] SM.ENG.SatmapEngineCommonLibrary.Emailingclient.EmailAlerts - [Alerts Initialized with SMTPserver: 10.80.10.141, SMTPport: , Recipients: ssadaasd, Sender Address: noreply-aasdlabaasd]~~<br/>
2019-12-02 08:53:41,856 INFO [35] PRO.0.Q-AS-R - [QueryStatusResponse|ASC|AID:70032|DN:20029|CS:client_MODE_UNKNOWN=>client_MODE_READY|TS:TALK_STATE_UNKNOWN=>TALK_STATE_AVAILABLE]~~<br/>
2019-12-02 08:53:41,863 INFO [31] CTL - [HandleclientStatusEvent(client_MODE_READY)|AID:70032|KeepFreeclientOrder:False|ER:SUCCESSFULL]~~<br/>
2019-12-02 08:53:41,871 DEBUG [27] GSI.Comms.CM - [Tx|0|clientFree|MsgId:22|D-CID:1]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF - [Rx|QUERY_TOD_RESPONSE|INV-ID:20|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0 - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0 - [RR|QueryTodRequest|INV-ID:20|08:53:41 808,08:53:41 817,08:53:41 899|RTT:90.98ms|PT:8.97]~~<br/>
2019-12-02 08:53:41,899 INFO [35] PRO.0 - [CR|QueryTodRequest(ToCheckIfSkillMonitorCompleted)|S:80029|Pass:3]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF - [Rx|QUERY_TOD_RESPONSE|INV-ID:21|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0 - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0 - [RR|QueryTodRequest|INV-ID:21|08:53:41 814,08:53:41 820,08:53:41 899|RTT:85.01ms|PT:6]~~<br/>
2019-12-02 08:53:41,899 INFO [36] TSP.0.1 - [Tx|QueryTime]~~<br/>
2019-12-02 08:53:41,900 INFO [36] PRO.0.RM - [RP|QueryTodRequest|INV-ID:22|ICheckRequest:False|SkillId:80029(Pass:3)]~~<br/>
2019-12-02 08:53:41,903 INFO [36] PRO.0 - [ST|OR Count:0|PR Count:1]~~
2019-12-02 08:53:41,950 DEBUG [24] TSP.EF - [Rx|QUERY_TOD_RESPONSE|INV-ID:22|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0 - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0 - [RR|QueryTodRequest|INV-ID:22|08:53:41 899,08:53:41 899,08:53:41 950|RTT:51ms|PT:]~~<br/>
2019-12-02 08:53:41,950 INFO [35] PRO.0.Q-STOD - [SkillMonitored, Marked for Route Register|SK:80029]~~<br/>
2019-12-02 08:53:41,952 INFO [35] PRO.0.SMGR.1 - [RegisterSkill|SK:80029]~~<br/>
def myReader(file_name):
outRow = ''
for row in open(file_name, 'r'):
row = row.strip()
if row[-2:] == '~~':
outRow += row[0:-2]
yield outRow
outRow = ''
else:
outRow += row
它连接“ continuation”行(不以“ ~~”结尾)返回上一行并返回(收益)完整行。然后,假设您只有一个输入文件,加载您的DataFrame,运行:
df = pd.DataFrame(columns=['eventtime', 'FileType', 'msg_type', 'thread_num', 'msg_lyr', 'message', 'colorcode']) pat = re.compile(r'(?P<Date>[\d-]+ [\d:]+,\d+) (?P<msg_type>\w+) +' r'\[(?P<thread_num>\d+)\] (?P<msg_lyr>[\w.]+) +\- \[(?P<message>[^\]]+)') myGen = myReader('log.txt') for row in myGen: mtch = re.match(pat, row) if mtch: dat = mtch.group('Date').replace(',', '.') df.loc[len(df)] = pd.Series({'eventtime': dat, 'FileType': 'silogs', 'msg_type': mtch.group('msg_type'), 'thread_num': mtch.group('thread_num'), 'msg_lyr': mtch.group('msg_lyr'), 'message': mtch.group('message'), 'colorcode': ''})