如何从 HTML 表(例如,从市场数据 S&P 500)动态填充数据库?
我有一个 Yahoo! 帐户财务。在帐户中我可以查看 HTML 格式的财务数据。
我需要一个简单的工具来从 HTML 表填充数据库 (Access)。哪里可以找到这样的工具?
您可以从 Yahoo 历史数据导出为 CSV,并直接在 Access 中将该 csv 文件链接为 MS Access 表。 http://office.microsoft.com/en-ca/access-help/import-or-link-to-data-in-a-text-file-HA001232227.aspx
如果您想处理 html 页面源代码,那么此链接可能会有所帮助。
http://www.access-programmers.co.uk/forums/showthread.php?p=1145646
ACE/Jet OLEDB 可用于直接从 HTML 文件导入数据。例如,给定一个现有访问表[DataFromHtml]
ID LastName
-- --------
和一个包含表格的 HTML 文件
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<title>
Test Data
</title>
</head>
<body>
<table>
<tr>
<th>
ID
</th>
<th>
LastName
</th>
</tr>
<tr>
<td>
1
</td>
<td>
Thompson
</td>
</tr>
<tr>
<td>
2
</td>
<td>
O'Rourke
</td>
</tr>
</table>
</body>
</html>
以下 VBA 代码将清除 Access 表 (
DELETE FROM
),然后将 HTML 表数据导入其中。
Sub ImportFromHtml()
Const LocalTableName = "DataFromHtml"
Dim con As Object, rstHtml As Object, fld As Object, _
cdb As DAO.Database, rstAccdb As DAO.Recordset, _
recCount As Long
Set con = CreateObject("ADODB.Connection")
con.Open _
"Provider=Microsoft.ACE.OLEDB.12.0;" & _
"Data Source=C:\Users\Gord\Documents\table.htm;" & _
"Extended Properties=""HTML Import;HDR=YES;IMEX=1"";"
Set rstHtml = CreateObject("ADODB.Recordset")
rstHtml.Open "SELECT * FROM [Test Data]", con
Set cdb = CurrentDb
cdb.Execute "DELETE FROM [" & LocalTableName & "]", dbFailOnError
Set rstAccdb = cdb.OpenRecordset(LocalTableName, dbOpenTable)
recCount = 0
Do While Not rstHtml.EOF
recCount = recCount + 1
rstAccdb.AddNew
For Each fld In rstHtml.Fields
rstAccdb.Fields(Trim(fld.Name)).Value = Trim(fld.Value)
Next
Set fld = Nothing
rstAccdb.Update
rstHtml.MoveNext
Loop
rstAccdb.Close
Set rstAccdb = Nothing
Set cdb = Nothing
rstHtml.Close
Set rstHtml = Nothing
con.Close
Set con = Nothing
Debug.Print recCount & " record(s) imported"
End Sub
假设 Gord Thompson 解决方案的 HTML 结构,有一种使用 ADO 的非常快速的方法。
Public Function GetTitle(ByVal HtmlFile As String) As String
Dim DOM As Object
Set DOM = CreateObject("MSXML2.DOMDocument")
DOM.Load HtmlFile
GetTitle = DOM.getElementsByTagName("title")(0).Text
End Function
Public Sub Import(ByVal Filename As String, ByVal Tablename As String)
Dim SQL As String
Dim Title As String
On Error GoTo Import_Error
Title = GetTitle(Filename)
CurrentProject.Connection.Execute "DROP TABLE " & Tablename
SQL = "SELECT * INTO " & Tablename & _
" FROM [HTML Import;HDR=YES;IMEX=1;DATABASE=" & Filename & "].[" & Title & "]"
CurrentProject.Connection.Execute SQL
Exit Sub
Import_Error:
End Sub
因此,您想将 HTML 文件“C:\SomeFolder\MyFile.html”获取到表“MyImport”中,请使用:
Import "C:\SomeFolder\MyFile.html", "MyImport"
一个附加提示:如果 HTML 文件的标题包含特殊字符,例如 .或:,导入将失败。你必须尝试一下哪些特殊字符有问题,哪些没有。
我知道这是一个老问题,但希望我的解决方案能帮助别人。
我最近收到大量单独的 html 文件中的表格。这些文件是从 Oracle/Unix 系统导出的;具有 csv 导出功能的版本之前的版本。
我在 Win10 和 SSD 上有 MS-Access 365。
我首先尝试了上面的解决方案,但这似乎与手动导入每个 html 一样慢。
所以我尝试了一种不同的方法。我添加了对“Microsoft HTML 对象库”的引用。
唯一的警告是:它不是动态的:我没有添加代码来映射每一列,但可以动态创建一个表。
' import a table from a local html file into a similar Access table
Function Import(FileName As String, TableName As String) As Long ' returns number of rows
Dim FSO As New FileSystemObject ' faster and proper read of \n
Dim ts As TextStream
Dim fld As Long, rows As Long, bt As Long, tm As Single
Dim rs As DAO.Recordset
Dim doc As New MSHTML.HTMLDocument
Dim trTag As Object
Dim tdTag As Object
On Error GoTo errH
Debug.Print "Loading " & FileName
If Dir(FileName) > " " Then
tm = Timer
bt = FileLen(FileName)
Set ts = FSO.OpenTextFile(FileName, ForReading)
doc.body.innerHTML = ts.ReadAll
Do Until doc.ReadyState = "complete" ' 4
DoEvents
Loop
ts.Close
Debug.Print "Loaded " & bt & " bytes in " & (Timer - tm) & " seconds. "
Else
MsgBox "File not exist"
Exit Function
End If
CurrentDb.Execute "DELETE FROM [" & TableName & "]", dbFailOnError
Set rs = CurrentDb.OpenRecordset(TableName, dbOpenTable)
Debug.Print "Destination has " & rs.Fields.Count & " fields (starting at 0)."
For Each trTag In doc.getElementsByTagName("tr")
If rows > 0 Then
rs.AddNew ' first row contains header. Some column names MSaccess don't like
fld = 0
End If
For Each tdTag In trTag.childNodes()
fld = fld + 1 'field counter
If rows = 0 Then
Debug.Print tdTag.innerText; "|"; ' field names from source
Else
' add field
rs.Fields(fld) = Trim(tdTag.innerText) ' When null use Resume Next - Nz not working here
End If
Next
'add row
If rows > 0 Then
rs.Update
Else
Debug.Print
Debug.Print "Source has " & fld & " fields."
' you could create a table here on the fly
' with field(0) being an autonumber key field, all others dbtext(255)
' then you re-open the Recordset
End If
rows = rows + 1: fld = 0
If rows Mod 1000 = 0 Then
DoEvents ' ability to interupt
End If
Next
Import = rows
rs.Close
Set rs = Nothing
Debug.Print "File imported in " & (Timer - tm) & " seconds. "
Exit Function
errH:
Resume Next
End Function
我用于此测试的数据文件大小为 31MB,包含 46K+ 行和 22 个字段。我还有超过 200MB 的其他文件。该函数的加载时间为 52 秒,而使用 OLEDB 版本则需要 170 秒。这告诉我,MSHTML.HTMLDocument dll 解释 html 文件的速度比 Microsoft.ACE.OLEDB.12.0 更快。