Ahlan Wa Sahlan

Friday, September 21, 2018

API docs for “pymine.beautifulsoup.BeautifulSoup.BeautifulSoup”

Class p.b.B.BeautifulSoup(BeautifulStoneSoup):

Part of pymine.beautifulsoup.BeautifulSoup View In Hierarchy

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.
Method __init__ The Soup object is initialized as the 'root tag', and the
Method start_meta Beautiful Soup can detect a charset included in a META tag,

Inherited from BeautifulStoneSoup:

Method convert_charref This method fixes a bug in Python's SGMLParser.
Method _feed Undocumented
Method __getattr__ This method routes method call requests to either the SGMLParser
Method isSelfClosingTag Returns true iff the given string is the name of a
Method reset Undocumented
Method popTag Undocumented
Method pushTag Undocumented
Method endData Undocumented
Method _popToTag Pops the tag stack up to and including the most recent
Method _smartPop We need to pop up to the previous tag of this type, unless
Method unknown_starttag Undocumented
Method unknown_endtag Undocumented
Method handle_data Undocumented
Method _toStringSubclass Adds a certain piece of text to the tree as a NavigableString
Method handle_pi Handle a processing instruction as a ProcessingInstruction
Method handle_comment Handle comments as Comment objects.
Method handle_charref Handle character references as data.
Method handle_entityref Handle entity references as data, possibly converting known
Method handle_decl Handle DOCTYPEs and the like as Declaration objects.
Method parse_declaration Treat a bogus SGML declaration as raw data. Treat a CDATA

Inherited from Tag (via BeautifulStoneSoup):

Method _invert Cheap function to invert a hash.
Method _convertEntities Used in a call to re.sub to replace HTML, XML, and numeric
Method getString Undocumented
Method setString Replace the contents of the tag with a string
Method getText Undocumented
Method get Returns the value of the 'key' attribute for the tag, or
Method clear Extract all children.
Method index Undocumented
Method has_key Undocumented
Method __getitem__ tag[key] returns the value of the 'key' attribute for the tag,
Method __iter__ Iterating over a tag iterates over its contents.
Method __len__ The length of a tag is the length of its list of contents.
Method __contains__ Undocumented
Method __nonzero__ A tag is non-None even if it has no contents.
Method __setitem__ Setting tag[key] sets the value of the 'key' attribute for the
Method __delitem__ Deleting tag[key] deletes all 'key' attributes for the tag.
Method __call__ Calling a tag like a function is the same as calling its
Method __eq__ Returns true iff this tag has the same name, the same attributes,
Method __ne__ Returns true iff this tag is not identical to the other tag,
Method __repr__ Renders this tag as a string.
Method __unicode__ Undocumented
Method _sub_entity Used with a regular expression to substitute the
Method __str__ Returns a string or Unicode representation of this tag and
Method decompose Recursively destroys the contents of this tree.
Method prettify Undocumented
Method renderContents Renders the contents of this tag as a string in the given
Method find Return only the first child of this Tag matching the given
Method findAll Extracts a list of Tag objects that match the given
Method fetchText Undocumented
Method firstText Undocumented
Method _getAttrMap Initializes a map representation of this tag's attributes,
Method childGenerator Undocumented
Method recursiveChildGenerator Undocumented

Inherited from PageElement (via BeautifulStoneSoup, Tag):

Method setup Sets up the initial relations between this element and
Method replaceWith Undocumented
Method replaceWithChildren Undocumented
Method extract Destructively rips this element out of the tree.
Method _lastRecursiveChild Finds the last element beneath this object to be parsed.
Method insert Undocumented
Method append Appends the given tag to the contents of this tag.
Method findNext Returns the first item that matches the given criteria and
Method findAllNext Returns all items that match the given criteria and appear
Method findNextSibling Returns the closest sibling to this Tag that matches the
Method findNextSiblings Returns the siblings of this Tag that match the given
Method findPrevious Returns the first item that matches the given criteria and
Method findAllPrevious Returns all items that match the given criteria and appear
Method findPreviousSibling Returns the closest sibling to this Tag that matches the
Method findPreviousSiblings Returns the siblings of this Tag that match the given
Method findParent Returns the closest parent of this Tag that matches the given
Method findParents Returns the parents of this Tag that match the given
Method _findOne Undocumented
Method _findAll Iterates over a generator looking for things that match.
Method nextGenerator Undocumented
Method nextSiblingGenerator Undocumented
Method previousGenerator Undocumented
Method previousSiblingGenerator Undocumented
Method parentGenerator Undocumented
Method substituteEncoding Undocumented
Method toEncoding Encodes an object to a string in some encoding, or to Unicode.
def __init__(self, *args, **kwargs):
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

sgmllib will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
sgmllib, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke sgmllib:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.
def start_meta(self, attrs):
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
API Documentation for pymine, generated by pydoctor at 2010-04-07 23:15:24.

No comments:

Post a Comment