What are some PHP functions or methods that can be used to extract only the visible text from an HTML page?
When extracting visible text from an HTML page, we need to remove any HTML tags, scripts, styles, and other non-text content. One way to achieve this is by using PHP functions like strip_tags() to remove HTML tags and preg_replace() with a regular expression to remove scripts and styles.
// Function to extract only visible text from an HTML page
function extractVisibleText($html) {
// Remove HTML tags
$text = strip_tags($html);
// Remove scripts and styles
$text = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', '', $text);
$text = preg_replace('/<style\b[^>]*>(.*?)<\/style>/is', '', $text);
// Remove extra whitespace and newlines
$text = preg_replace('/\s+/', ' ', $text);
return trim($text);
}
// Usage example
$html = file_get_contents('https://example.com');
$visibleText = extractVisibleText($html);
echo $visibleText;
Keywords
Related Questions
- What are the privacy implications and best practices regarding the use of HTTP Referrers in web development?
- What are the best practices for properly initializing and populating multidimensional arrays in PHP for further processing outside of loops?
- What are some alternative methods or libraries in PHP for generating combinations of characters efficiently?