如何解析纯文本表? (多行)

问题描述 投票:2回答:1

我想解析一个易于在视觉上阅读但缺乏任何真实模式的表格。我希望它以python中的字典形式出现,但我最终会将其转换为数据框架。从左到右实际上有6列:课程1单元,课程1代码,课程1标题,课程2代码,课程2标题,课程2单元。但这可能更难以解析。任何帮助表示赞赏。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<!-- $Id: bymaj_report.htm,v 1.9 2008/01/29 17:54:48 adt Exp $ -->
<html>
<head>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="-1" http-equiv="Expires"/>
<meta content="NO-CACHE" http-equiv="CACHE-CONTROL"/>
<title> ASSIST: By Major Report </title>
</head>
<body style="margin: 0; padding: 0; background: #FFFEEA; font-family: Arial, Helvetica, sans-serif; font-size: 12px;">
<p align="LEFT">
<pre>
                        Articulation Agreement by Major                         
                    Effective during the 16-17 Academic Year                    
     
<b>    ====Electrical Engineering &amp; Computer Sciences, Lower Division B.S.====     </b>
<b>                                                                             
COLLEGE OF ENGINEERING JUNIOR TRANSFER ADMISSION REQUIREMENTS:</b>              
                                                                                
Admission to the UC Berkeley <b>College of Engineering </b>is highly competitive.
                                                                                
Applicants to the <b>Electrical Engineering and Computer Science </b>major must
complete all <u>required</u> core UCB preparation courses in order to be eligible for
admission. Only applicants who have completed 100% of these <u>required</u> courses
will be considered for admission.  Required courses for admission to the major
must be completed by the end of the spring semester prior to fall enrollment.
<b>A summer 2017 course is not considered to be "work in progress" for the fall
2017 selection process.
                                                                                
</b>If a series of courses at a community college is required (e.g., English 1A + 1B
+ 103 = English R1A and R1B), <u>all</u> the courses in the series <u>must</u> be completed,
and <u>must</u> (unless otherwise indicated) be completed at the same community
college. Partial completion (e.g., 2 of the 3 required courses) will result in
zero credit toward the requirement(s), and the applicant will NOT be considered
for admission.
<b>                                                                             
</b>Lower division UC Berkeley courses required for graduation (but not admission)
are also listed in the major agreements and are strongly recommended to be taken
to strengthen one's application.  The more of these courses completed, the
stronger the application will be.
                                                                                
Required core courses for admission: (all these courses must be completed to be
considered for admission)
                                                                                
- UCB Math 1A, 1B                                                               
- UCB Math 53, 54                                                               
- UCB Physics 7A, 7B                                                            
- UCB English R1A and R1B                                                       
- One from UCB Astronomy 7A or 7B or Bio 1A &amp; 1AL or Bio 1B or Chem 1A/1AL or   
  Chem 1B or Chem 3A/L or 3B/L or Mcellbi 32 &amp; 32L or Physics 7C                
                                                                                
Strongly recommended courses: (if your college offers the courses listed below
and they are articulated, taking them will strengthen your application)
                                                                                
Electrical Engineering 20 and 40 were taught for the final time at UCB in Fall
2015.  Electrical Engineering 20 and Electrical Engineering 40 have been
replaced with Electrical Engineering 16A and 16B.  <b>The curriculum changes are
effective for students admitted beginning Fall 15. </b>
                                                                                
- UCB Compsci 61A                                                               
- UCB Compsci 61B                                                               
- UCB Compsci 61C                                                               
- UCB El Eng 16A                                                                
- UCB El Eng 16B                                                                
- UCB Compsci 70                                                                
                                                                                
Admission is primarily based on the completeness of the applicant's lower
division preparation and the level of academic achievement reflected in the
student's grade point average.  The UC applicant essay also plays an important
role in the selection process at UC Berkeley.  The College reviews the essay for
evidence of interest in the student's chosen field and a thoughtful match
between the academic program and the student's academic and career objectives.
                                                                                
The College of Engineering requires six humanities/social science courses, two
of which must be reading and composition.  The only non-technical admission
requirement for the College of Engineering is the coursework equivalent to UC
Berkeley's English R1A and R1B (reading and composition), which must be taken
for a letter grade. The College of Engineering <b>does not recognize the
Intersegmental General Education Transfer Curriculum (IGETC) and strongly
discourages</b> students from following this option due to the number of
major-specific technical courses required for engineering transfer admission.
<b>NOTE:</b> The English R1A and R1B requirements <u>cannot</u> be satisfied by IGETC;
applicants <u>must</u> complete the specific courses indicated as English R1A and R1B
equivalents to be considered for admission. Failure to complete the exact
courses listed will mean the applicant will NOT be considered for admission.
                                                                                
The remaining four humanities/social science requirement courses are not
considered for admission purposes but are required for graduation.  See
<a href="http://coe.berkeley.edu/hssreq" target="_blank">http://engineering.berkeley.edu/hssreq</a> for the College of Engineering
humanities/social science breadth requirements and courses.  Courses which are
three semester units or more that appear in the following categories on the
"General Education/Breadth" section of <a href="http://assist.org" target="_blank">assist.org</a> may be used to satisfy
<b>two of</b> the remaining four humanities/social science course requirements for the
College of Engineering.  ARTS AND LITERATURE; HISTORICAL STUDIES; INTERNATIONAL
STUDIES; PHILOSOPHY AND VALUES; SOCIAL AND BEHAVIORAL SCIENCES.
                                                                                
SAT/ACT/A-level test scores and letters of recommendation are NOT considered for
admission.
                                                                                
<b>NOTE: ALL REQUIRED COURSES AND ALL STRONGLY RECOMMENDED COURSES FOR THE MAJOR
MUST BE TAKEN FOR A LETTER GRADE.  FOR MORE INFORMATION, PLEASE CHECK THE
COLLEGE'S WEB SITE FOR THE <u>COLLEGE OF ENGINEERING UNDERGRADUATE GUIDE.</u>
                                                                                
For more information:                                                           
</b><a href="http://engineering.berkeley.edu/admissions/undergraduate-admissions" target="_blank">http://engineering.berkeley.edu/admissions/undergraduate-admissions</a> <b>
                                                                                
College of Engineering Undergraduate Guide:</b>                                 
<a href="http://engineering.berkeley.edu/academics/undergraduate-guide" target="_blank">http://engineering.berkeley.edu/academics/undergraduate-guide</a><b><a href="http://coe.berkeley.edu/guide " target="_blank">
</a>                                                                            
For more information on Electrical Engineering &amp; Computer Science:</b>          
<a href="http://www.eecs.berkeley.edu" target="_blank">http://www.eecs.berkeley.edu</a>
                                                                                
<b>For more information on admission to UC Berkeley:</b>                        
<a href="http://admissions.berkeley.edu" target="_blank">http://admissions.berkeley.edu</a>
                                                                                
<b>For more information on majors at UC Berkeley:</b>                           
<b>Berkeley Academic Guide: </b><a href="http://guide.berkeley.edu/" target="_blank">http://guide.berkeley.edu</a>
                                                                                
--------------------------------------------------------------------------------
                                 <b> AP TEST CREDIT</b>                         
                                                                                
For students who have taken Advanced Placement Exams in high school, the College
will clear requirements as follows:
                                                                                
Biology AP: a score of 4 or 5 satisfies UCB Biology 1A/AL and 1B.               
Chemistry AP: a score of 3 or better satisfies UCB Chemistry 1A/1AL.            
English AP (Literature and Composition): a score of 4 or 5 satisfies UCB English
R1A.
English AP (Language and Composition): a score of 4 or 5 satisfies UCB English
R1A.
Mathematics AP (AB Exam): a score of 3 or better satisfies UCB Math 1A.         
Mathematics AP (BC Exam): a score of 3 satisfies UCB Math 1A.                   
Mathematics AP (BC Exam): a score of 4 or 5 satisfies UCB Math 1A and 1B.       
Physics AP (Mechanics C Exam): a score of 5 satisfies UCB Physics 7A.           
--------------------------------------------------------------------------------
                          <b>Required Courses for Admission:</b>                
--------------------------------------------------------------------------------
MATH 1A    Calculus                   (4)|MATH 150    Calculus and Analytic  (5)
                                         |            Geometry I 
--------------------------------------------------------------------------------
MATH 1B    Calculus                   (4)|MATH 155    Calculus and Analytic  (4)
                                         |            Geometry II 
--------------------------------------------------------------------------------
MATH 53    Multivariable Calculus     (4)|MATH 260    Calculus and Analytic  (4)
                                         |            Geometry III 
--------------------------------------------------------------------------------
MATH 54    Linear Algebra and         (4)|MATH 265 <b><u>&amp;</u></b></pre></p></body></html>  Differential Equations (4)
           Differential Equations        |MATH 270    Linear Algebra         (4)
--------------------------------------------------------------------------------
PHYSICS 7A    Physics for Scientists  (4)|PHYS 151    Principles of Physics  (4)
              and Engineers              |            I 
--------------------------------------------------------------------------------
PHYSICS 7B    Physics for Scientists  (4)|PHYS 152    Principles of Physics  (4)
              and Engineers              |            II 
--------------------------------------------------------------------------------
ENGLISH R1A    Reading and            (4)|ENGL 100    Composition and        (4)
               Composition               |            Reading 
--------------------------------------------------------------------------------
ENGLISH R1B    Reading and            (4)|ENGL 201     Critical Thinking,    (4)
               Composition               |             Composition, and 
                                         |             Literature 
                                         |    <b><u>OR</u></b> 
                                         | 
                                         |ENGL 201H    Critical Thinking,    (4)
                                         |             Composition, and 
                                         |             Literature (Honors) 
                                         |    <b><u>OR</u></b> 
                                         |ENGL 202     Critical Thinking and (4)
                                         |             Composition 
                                         |    <b><u>OR</u></b> 
                                         |ENGL 202H    Critical and Thinking (4)
                                         |             and Composition 
                                         |             (Honors) 
--------------------------------------------------------------------------------
<b>                   Natural Science required for admission:                   
             </b>One course or course series required from the list below:      
--------------------------------------------------------------------------------
ASTRON 7A    Introduction to          (4)|NO COURSE ARTICULATED 
             Astrophysics                |                                      
--------------------------------------------------------------------------------
ASTRON 7B    Introduction to          (4)|NO COURSE ARTICULATED 
             Astrophysics                |                                      
--------------------------------------------------------------------------------
BIOLOGY 1A  <b><u>&amp;</u></b>  General Biology        (3)|BIO 202 <b><u>&amp;</u></b>  Foundations of Biology: (4)
               Lecture (Cells,           |           Evolution, 
               Genetics, Animal Form     |           Biodiversity and 
               &amp; Function)               |           Organismal Biology 
BIOLOGY 1AL <b><u>&amp;</u></b>  General Biology        (2)|BIO 204    Foundations of Biology: (4)
               Laboratory                |           Biochemistry, Cell 
BIOLOGY 1B     General Biology (Plant (4)|           Biology, Genetics and 
               Form &amp; Function,          |           Molecular Biology 
               Ecology, Evolution)       |                                      
--------------------------------------------------------------------------------
CHEM 1A  <b><u>&amp;</u></b>  General Chemistry         (3)|CHEM 110  <b><u>&amp;</u></b>  General Chemistry     (5)
CHEM 1AL <b><u>&amp;</u></b>  General Chemistry         (1)|CHEM 111     General Chemistry     (5)
            Laboratory                   |    <b><u>OR</u></b> 
CHEM 1B     General Chemistry         (4)|CHEM 110H <b><u>&amp;</u></b>  General Chemistry I   (5)
                                         |             (Honors) 
                                         |CHEM 111H    General Chemistry II  (5)
                                         |             (Honors) 
--------------------------------------------------------------------------------
CHEM 3A  <b><u>&amp;</u></b>  Chemical Structure and    (3)|CHEM 210     Organic Chemistry I   (5)
            Reactivity                   |    <b><u>OR</u></b> 
CHEM 3AL    Organic Chemistry         (2)|CHEM 210H    Organic Chemistry I   (5)
            Laboratory                   |             (Honors) 
--------------------------------------------------------------------------------
CHEM 3B  <b><u>&amp;</u></b>  Chemical Structure and    (3)|CHEM 211     Organic Chemistry II  (5)
            Reactivity                   |    <b><u>OR</u></b> 
CHEM 3BL    Organic Chemistry         (2)|CHEM 211H    Organic Chemistry II  (5)
            Laboratory                   |             (Honors) 
--------------------------------------------------------------------------------
MCELLBI 32  <b><u>&amp;</u></b>  Introduction to Human  (3)|BIO 220    Human Physiology        (4)
               Physiology                |                                      
MCELLBI 32L    Introduction to Human  (2)|                                      
               Physiology Laboratory     |                                      
--------------------------------------------------------------------------------
PHYSICS 7C    Physics for Scientists  (4)|PHYS 253    Principles of Physics  (4)
              and Engineers              |            III 
--------------------------------------------------------------------------------
<b>Strongly Recommended Courses</b> (if your college offers courses listed below and
they are articulated, taking them will strengthen your application):
                                                                                
Electrical Engineering 20 and 40 were taught for the final time at UCB in Fall
2015.  Electrical Engineering 20 and Electrical Engineering 40 have been
replaced with Electrical Engineering 16A and 16B.  <b>The curriculum changes are
effective for students admitted beginning Fall 15</b>.
                                                                                
If no articulation, students are strongly encouraged to take an introductory
course in electronics or circuits AND courses in Java, C++ and Data Structures.
--------------------------------------------------------------------------------
COMPSCI 61A    The Structure and      (4)|NO COURSE ARTICULATED 
               Interpretation of         |                                      
               Computer Programs         |                                      
--------------------------------------------------------------------------------
COMPSCI 61B    Data Structures        (4)|CS 112 <b><u>&amp;</u></b>  Introduction to Computer (3)
                                         |          Science II: Java 
                                         |CS 113    Basic Data Structures    (3)
                                         |          and Algorithms 
                                         |<b>NOTE:</b>  Students must also complete UCB
                                         |COMPSCI 47B at Berkeley to satisfy 
                                         |this requirement. 
--------------------------------------------------------------------------------
COMPSCI 61C    Machine Structures     (4)|NO COURSE ARTICULATED 
--------------------------------------------------------------------------------
EL ENG 16A    Designing Information   (4)|NO COURSE ARTICULATED 
              Devices and Systems I      |                                      
--------------------------------------------------------------------------------
EL ENG 16B    Designing Information   (4)|NO COURSE ARTICULATED 
              Devices and Systems II     |                                      
--------------------------------------------------------------------------------
COMPSCI 70    Discrete Mathematics    (4)|NO COURSE ARTICULATED 
              and Probability Theory     |                                      
--------------------------------------------------------------------------------
<b>IMPORTANT INFORMATION ABOUT THIS MAJOR:</b>
<b>- </b>The course/s cited have been officially accepted by this major and approved
  by both a Berkeley advisor/faculty member and Berkeley's articulation         
  officer.                                                                      
                                                                                
- Consult ASSIST frequently to obtain current information as this articulation  
  agreement is subject to periodic revision.                                    
--------------------------------------------------------------------------------
<b>END OF MAJOR</b>
<br/>

我想把那张桌子放在底部。必修课程。我已附上原始网址。

http://web2.assist.org/web-assist/report.do?agreement=aa&reportPath=REPORT_2&reportScript=Rep2.pl&event=19&dir=2&sia=MIRACSTA&ria=UCB&ia=UCB&oia=MIRACSTA&aay=16-17&ay=16-17&dora=EECS

python database parsing web-scraping plaintext
1个回答
0
投票

您可以使用正则表达式解析表。从文件中,我做了以下观察:

  1. 我能看到的是你有一个课程分隔符:--------------------------------------------------------------------------------
  2. 然后在这个分隔符之间,你有一个左手和一个右手,由管道|界定。
  3. 此外,当引用模块时,您在括号内右侧有一个数字(例如(1))。

第1部分您可以轻松实现,第2部分和第3部分可以使用正则表达式部分解决。 我在这里提供了解决方案中最难的部分,您可以轻松实现其余部分,也许可以放入一个pandas数据帧。

import re
# separate left hand and right hand 
strline = "course name |MATH 123 divide by 0 (1)"
left, right = re.split('\|',strline)  # split around the pipe
# find the parenthesis marker on the right
m = re.search(r'\(\d+\)',right)
has_module_name = m is not None

if has_module_name:
   # catch the module name, description and exclude number in parenthesis
   m_mod = re.match(r'([a-zA-Z0-9]+ [a-zA-Z0-9]+)([^\(\)]+).*',right)

   if m_mod is not None:
      mod_code = m_mod.group(1)
      mod_desc = m_mod.group(2)
      print("module name: {}".format(mod_code))
      print("module first line desc: {}".format(mod_desc))

# todo: make loops to catch other lines of desc, remove duplicate whitespaces

返回:

module name: MATH 123
module first line desc:  divide by 0
© www.soinside.com 2019 - 2024. All rights reserved.