I have a SGML file that mixes tags that require closing and those that don't. BeautifulSoup can prettify this for HTML, but my tags are custom and BeautifulSoup just closes them in the end of the file. Here's the source:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1122304/000119312515118890/0001193125-15-118890.hdr.sgml'
sgml = requests.get(url).text
soup = BeautifulSoup(sgml, 'html5lib')
And here's the file:
<SEC-HEADER>0001193125-15-118890.hdr.sgml : 20150403
<ACCEPTANCE-DATETIME>20150403143902
<ACCESSION-NUMBER>0001193125-15-118890
<TYPE>DEF 14A
<PUBLIC-DOCUMENT-COUNT>37
<PERIOD>20150515
<FILING-DATE>20150403
<DATE-OF-FILING-DATE-CHANGE>20150403
<EFFECTIVENESS-DATE>20150403
<FILER>
<COMPANY-DATA>
<CONFORMED-NAME>AETNA INC /PA/
<CIK>0001122304
<ASSIGNED-SIC>6324
<IRS-NUMBER>232229683
<STATE-OF-INCORPORATION>PA
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
...
</SEC-HEADER>
Where FILER and COMPANY-DATA requires a closing tag and others don't.
How can I tell BeautifulSoup's parser to close certain tags at the end of the line? Does it have something to do with how BS deals with br and li vs. a and div?