Class p.b.B.BeautifulSoup(BeautifulStoneSoup):

Part of pymine.beautifulsoup.BeautifulSoup View In Hierarchy

Known subclasses: pymine.beautifulsoup.BeautifulSoup.ICantBelieveItsBeautifulSoup, pymine.beautifulsoup.BeautifulSoup.MinimalSoup, pymine.beautifulsoup.BeautifulSoup.RobustHTMLParser

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Method	__init__	The Soup object is initialized as the 'root tag', and the
Method	start_meta	Beautiful Soup can detect a charset included in a META tag,

Inherited from BeautifulStoneSoup:

Method	convert_charref	This method fixes a bug in Python's SGMLParser.
Method	_feed	Undocumented
Method	__getattr__	This method routes method call requests to either the SGMLParser
Method	isSelfClosingTag	Returns true iff the given string is the name of a
Method	reset	Undocumented
Method	popTag	Undocumented
Method	pushTag	Undocumented
Method	endData	Undocumented
Method	_popToTag	Pops the tag stack up to and including the most recent
Method	_smartPop	We need to pop up to the previous tag of this type, unless
Method	unknown_starttag	Undocumented
Method	unknown_endtag	Undocumented
Method	handle_data	Undocumented
Method	_toStringSubclass	Adds a certain piece of text to the tree as a NavigableString
Method	handle_pi	Handle a processing instruction as a ProcessingInstruction
Method	handle_comment	Handle comments as Comment objects.
Method	handle_charref	Handle character references as data.
Method	handle_entityref	Handle entity references as data, possibly converting known
Method	handle_decl	Handle DOCTYPEs and the like as Declaration objects.
Method	parse_declaration	Treat a bogus SGML declaration as raw data. Treat a CDATA

Inherited from Tag (via BeautifulStoneSoup):

Method	_invert	Cheap function to invert a hash.
Method	_convertEntities	Used in a call to re.sub to replace HTML, XML, and numeric
Method	getString	Undocumented
Method	setString	Replace the contents of the tag with a string
Method	getText	Undocumented
Method	get	Returns the value of the 'key' attribute for the tag, or
Method	clear	Extract all children.
Method	index	Undocumented
Method	has_key	Undocumented
Method	__getitem__	tag[key] returns the value of the 'key' attribute for the tag,
Method	__iter__	Iterating over a tag iterates over its contents.
Method	__len__	The length of a tag is the length of its list of contents.
Method	__contains__	Undocumented
Method	__nonzero__	A tag is non-None even if it has no contents.
Method	__setitem__	Setting tag[key] sets the value of the 'key' attribute for the
Method	__delitem__	Deleting tag[key] deletes all 'key' attributes for the tag.
Method	__call__	Calling a tag like a function is the same as calling its
Method	__eq__	Returns true iff this tag has the same name, the same attributes,
Method	__ne__	Returns true iff this tag is not identical to the other tag,
Method	__repr__	Renders this tag as a string.
Method	__unicode__	Undocumented
Method	_sub_entity	Used with a regular expression to substitute the
Method	__str__	Returns a string or Unicode representation of this tag and
Method	decompose	Recursively destroys the contents of this tree.
Method	prettify	Undocumented
Method	renderContents	Renders the contents of this tag as a string in the given
Method	find	Return only the first child of this Tag matching the given
Method	findAll	Extracts a list of Tag objects that match the given
Method	fetchText	Undocumented
Method	firstText	Undocumented
Method	_getAttrMap	Initializes a map representation of this tag's attributes,
Method	childGenerator	Undocumented
Method	recursiveChildGenerator	Undocumented

Inherited from PageElement (via BeautifulStoneSoup, Tag):

Method	setup	Sets up the initial relations between this element and
Method	replaceWith	Undocumented
Method	replaceWithChildren	Undocumented
Method	extract	Destructively rips this element out of the tree.
Method	_lastRecursiveChild	Finds the last element beneath this object to be parsed.
Method	insert	Undocumented
Method	append	Appends the given tag to the contents of this tag.
Method	findNext	Returns the first item that matches the given criteria and
Method	findAllNext	Returns all items that match the given criteria and appear
Method	findNextSibling	Returns the closest sibling to this Tag that matches the
Method	findNextSiblings	Returns the siblings of this Tag that match the given
Method	findPrevious	Returns the first item that matches the given criteria and
Method	findAllPrevious	Returns all items that match the given criteria and appear
Method	findPreviousSibling	Returns the closest sibling to this Tag that matches the
Method	findPreviousSiblings	Returns the siblings of this Tag that match the given
Method	findParent	Returns the closest parent of this Tag that matches the given
Method	findParents	Returns the parents of this Tag that match the given
Method	_findOne	Undocumented
Method	_findAll	Iterates over a generator looking for things that match.
Method	nextGenerator	Undocumented
Method	nextSiblingGenerator	Undocumented
Method	previousGenerator	Undocumented
Method	previousSiblingGenerator	Undocumented
Method	parentGenerator	Undocumented
Method	substituteEncoding	Undocumented
Method	toEncoding	Encodes an object to a string in some encoding, or to Unicode.

def __init__(self, *args, **kwargs):

overrides pymine.beautifulsoup.BeautifulSoup.BeautifulStoneSoup.__init__

The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

sgmllib will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
sgmllib, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke sgmllib:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

def start_meta(self, attrs):

Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.

API Documentation for pymine, generated by pydoctor at 2010-04-07 23:15:24.

Friday, September 21, 2018

Class p.b.B.BeautifulSoup(BeautifulStoneSoup):

No comments:

Post a Comment

Friends Site

Free Email Updates

Social

Popular Posts

Blogroll

About