What are some PHP functions or methods that can be used to extract only the visible text from an HTML page?
When extracting visible text from an HTML page, we need to remove any HTML tags, scripts, styles, and other non-text content. One way to achieve this is by using PHP functions like strip_tags() to remove HTML tags and preg_replace() with a regular expression to remove scripts and styles.
// Function to extract only visible text from an HTML page
function extractVisibleText($html) {
// Remove HTML tags
$text = strip_tags($html);
// Remove scripts and styles
$text = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', '', $text);
$text = preg_replace('/<style\b[^>]*>(.*?)<\/style>/is', '', $text);
// Remove extra whitespace and newlines
$text = preg_replace('/\s+/', ' ', $text);
return trim($text);
}
// Usage example
$html = file_get_contents('https://example.com');
$visibleText = extractVisibleText($html);
echo $visibleText;
Keywords
Related Questions
- Is it necessary to use the 'echo' statement when calling a function in PHP, and how does it affect the return value?
- How can PHP code be used to maintain formatting when retrieving data from a MySQL database for display on a webpage?
- What are common pitfalls when setting up a test server for PHP development?