如何将多行记录加载到数据框中

问题描述 投票:0回答:1

我正在尝试加载多个格式相似的日志文件,但是一条记录可以位于多行中,我已经做了一些开发,但是分别读取每一行会花费太多时间示例代码如下。

请帮助我

df = pd.DataFrame( {'eventtime':[],
                    'FileType':[],
                    'msg_type':[],
                    'thred_num':[],
                    'msg_lyr':[], 
                    'message':[],
                    'colorcode':[]
                    })
def write_line(record,file_type):
    if record=='':
        return

    split1= record.split("[")
    split2 = split1[0].split(" ")
    split3 = split1[1].split(" ")
    split4 = split1[2]

    s_time = split2[1].split(",")
    str_dateime  = split2[0] + ' ' + s_time[0] + "."+s_time[1]

    df.loc[len(df)] = pd.Series( {'eventtime':str_dateime,
                    'FileType':file_type,
                    'msg_type':split2[2],
                    'thred_num':split3[0][:-1],
                    'msg_lyr':split3[1], 
                    'message':split4,
                    'colorcode':""
                    })

for si_file in si_Files:
    f= gzip.open(si_file,'rt') 
    file_content = f.read()
    f.close()
    for record in file_content .split("]~~\n"):**
        write_line(record,'silogs')'''

是否有更好的方法可以将数据逐行加载到DataFrame中以读取文件,然后将其加载到数据帧中,这是资源过多,导致运行主应用程序的应用程序服务器上的资源浪费了

示例日志文件如下

2019-12-02 08:53:26,261 INFO  [18] CTL.CONF     - [Loading Configurations]~~<br>
2019-12-02 08:53:26,273 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: |User:default|ClientMachine:xxx.xxx.xxx.xxx]~~<br/>
2019-12-02 08:53:26,274 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: Waiting for connection with configuration server: xxx.xxx.xxx.xxx:X000]~~<br/>
2019-12-02 08:53:26,328 INFO  [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:26,329 WARN  [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:27,334 ERROR [19] GSI.Comms.SC - [Failed to connect with <br/>server|Endpoint:xxx.xxx.xx.xxx:x000|Error:No connection could be made because the target <br/>machine actively refused it xxx.xxx.xx.xxx:x000(ConnectionRefused:10061)]~~<br/>
2019-12-02 08:53:30,340 INFO  [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:30,341 WARN  [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:30,393 WARN  [19] sm.cs.client.CfgInterface - [Start: No QueryReload performed, EngineId missing.]~~<br/>
2019-12-02 08:53:30,393 DEBUG [19] sm.cs.client.CfgInterface - [ClientConnection |  Connected to the Server : Primary]~~<br/>
2019-12-02 08:53:30,393 DEBUG [18] sm.cs.client.CfgInterface - [Start: done.]~~<br/>
2019-12-02 08:53:30,512 DEBUG [13] CTL.CONF     - [ReloadResponse: Submitted]~~<br/>
**2019-12-02 08:53:31,791 INFO  [18] CTL.CONF     - [GroupSettingsXml|<Groups><br/>
xml tages EventProcessingLagThreshold 100 /EventProcessingLagThreshold <br/>
  XML tages QueueLengthThreshold 100 /QueueLengthThreshold <br/>
xml tags/Groups]~~<br/>**
2019-12-02 08:53:31,803 INFO  [18] CTL.CONF     - [EventProcessingLagThreshold:100]~~<br/>
2019-12-02 08:53:32,122 INFO  [18] SM.ENG.SatmapEngineCommonLibrary.Emailingclient.EmailAlerts - [Alerts Initialized with SMTPserver: 10.80.10.141, SMTPport: , Recipients: ssadaasd,  Sender Address: noreply-aasdlabaasd]~~<br/>
2019-12-02 08:53:41,856 INFO  [35] PRO.0.Q-AS-R - [QueryStatusResponse|ASC|AID:70032|DN:20029|CS:client_MODE_UNKNOWN=>client_MODE_READY|TS:TALK_STATE_UNKNOWN=>TALK_STATE_AVAILABLE]~~<br/>
2019-12-02 08:53:41,863 INFO  [31] CTL          - [HandleclientStatusEvent(client_MODE_READY)|AID:70032|KeepFreeclientOrder:False|ER:SUCCESSFULL]~~<br/>
2019-12-02 08:53:41,871 DEBUG [27] GSI.Comms.CM - [Tx|0|clientFree|MsgId:22|D-CID:1]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:20|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:20|08:53:41 808,08:53:41 817,08:53:41 899|RTT:90.98ms|PT:8.97]~~<br/>
2019-12-02 08:53:41,899 INFO  [35] PRO.0        - [CR|QueryTodRequest(ToCheckIfSkillMonitorCompleted)|S:80029|Pass:3]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:21|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:21|08:53:41 814,08:53:41 820,08:53:41 899|RTT:85.01ms|PT:6]~~<br/>
2019-12-02 08:53:41,899 INFO  [36] TSP.0.1      - [Tx|QueryTime]~~<br/>
2019-12-02 08:53:41,900 INFO  [36] PRO.0.RM     - [RP|QueryTodRequest|INV-ID:22|ICheckRequest:False|SkillId:80029(Pass:3)]~~<br/>
2019-12-02 08:53:41,903 INFO  [36] PRO.0        - [ST|OR Count:0|PR Count:1]~~
2019-12-02 08:53:41,950 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:22|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:22|08:53:41 899,08:53:41 899,08:53:41 950|RTT:51ms|PT:]~~<br/>
2019-12-02 08:53:41,950 INFO  [35] PRO.0.Q-STOD - [SkillMonitored, Marked for Route Register|SK:80029]~~<br/>
2019-12-02 08:53:41,952 INFO  [35] PRO.0.SMGR.1 - [RegisterSkill|SK:80029]~~<br/>
python pandas delimiter line-breaks
1个回答
0
投票
定义以下生成器:

def myReader(file_name): outRow = '' for row in open(file_name, 'r'): row = row.strip() if row[-2:] == '~~': outRow += row[0:-2] yield outRow outRow = '' else: outRow += row

它连接“ continuation”行(不以“ ~~”结尾)返回上一行并返回(收益)完整行。

然后,假设您只有一个输入文件,加载您的DataFrame,运行:

df = pd.DataFrame(columns=['eventtime', 'FileType', 'msg_type', 'thread_num', 'msg_lyr', 'message', 'colorcode']) pat = re.compile(r'(?P<Date>[\d-]+ [\d:]+,\d+) (?P<msg_type>\w+) +' r'\[(?P<thread_num>\d+)\] (?P<msg_lyr>[\w.]+) +\- \[(?P<message>[^\]]+)') myGen = myReader('log.txt') for row in myGen: mtch = re.match(pat, row) if mtch: dat = mtch.group('Date').replace(',', '.') df.loc[len(df)] = pd.Series({'eventtime': dat, 'FileType': 'silogs', 'msg_type': mtch.group('msg_type'), 'thread_num': mtch.group('thread_num'), 'msg_lyr': mtch.group('msg_lyr'), 'message': mtch.group('message'), 'colorcode': ''})

© www.soinside.com 2019 - 2024. All rights reserved.