GCP平台-Linux爬蟲

透過對大量網頁內容的分析和處理,爬蟲技術可以幫助人工智慧系統學習語言模式、用戶行為、市場趨勢等,進而提升其預測準確性、決策能力和個性化服務。


環境需求

1.Ubuntu Server v22.04.4
2.Beautifulsoup4 v4.1.0
3.Selenium v4.1.0
4.Webdriver_manager v3.5.1
5.Mysql v8.0.27
6.Python v3.10.0
7.Chrome v122.0.6261.57


GCP建置規格
名稱 instance-20240311-102353
可用區 asia-east1-b
機器類型 e2-medium
CPU 平台 Intel Broadwell
最低 CPU 平台 無
架構 x86/64
作業系統 Ubuntu
映像檔 ubuntu-2204-jammy-v20240307


功能

1.使用python、Selenium模擬用戶登入行為
2.將貼文資訊數據轉送Mysql,獲取FB粉絲團貼文資訊
3.將貼文資訊數據備份輸出成TXT檔


環境設定安裝

  • Beautifulsoup4 安裝:pip3 install beautifulsoup4
  • Selenium 安裝:pip3 install selenium
  • Webdriver_manager 安裝:pip3 install webdriver_manager
  • Webdriver_manager 安裝:pip3 install python3
  • Mysql 安裝:sudo apt install mysql-server

爬蟲程式碼(以爬取pythontw粉絲團為例):

  • 下載所需套件
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
  • 使用無痕模式登入FB,模擬用戶網頁滾輪以獲取更多文章資訊
FB_ID = "123456XXXXXAA@yahoo.com.tw"
FB_PW = "XXXXX"
fb_groups_URL = "https://www.facebook.com/groups/pythontw"

options = webdriver.ChromeOptions()
options.add_argument("incognito")

service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service = service, options = options)
driver.get("https://www.facebook.com")

print("go " + driver.title )

# options = Options()
# options.binary_location = "/usr/bin/google-chrome" 

# driver = webdriver.Chrome(options=options)

# service = Service(ChromeDriverManager().install())
# driver = webdriver.Chrome(service=service, options=options)

email = driver.find_element(By.ID, "email")
password = driver.find_element(By.ID, "pass")
login = driver.find_element(By.NAME, "login")

email.send_keys(FB_ID)
password.send_keys(FB_PW)

login.submit()

time.sleep(3)

# Goto website
driver.get(fb_groups_URL)

time.sleep(3)
# Score
# 對頁面做滾動 以獲取更多文章
for x in range(1, 4):
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
    time.sleep(5)

# Get source
soup = BeautifulSoup(driver.page_source, 'html.parser')
titles = soup.findAll("div", {"class":"x1iorvi4 x1pi30zi x1l90r2v x1swvt13"})

# print(soup.prettify())

print("一共:" + str(len(titles)) + " 則文章...")

for title in titles:
    # print(title)
    posts = title.findAll('div', {'dir': 'auto'})

    for post in posts:
        if post:
            print(post.text)

    print("========================================")

# Exit
  • 完成爬蟲後,將獲取的titles結果存成txt檔
posts_content = []
for title in titles:
    posts = title.findAll('div', {'dir': 'auto'})
    for post in posts:
        if post:
            posts_content.append(post.text)

with open('FB_posts.txt', 'w', encoding='utf-8') as file:
    for content in posts_content:
        file.write(content + "\n")
        file.write("========================================\n")
  • 將資料轉送資料庫
import mysql.connector
from mysql.connector import Error

#連接到 MySQL的python資料庫
db = mysql.connector.connect(
    host="localhost",    
    user="admin", 
    passwd="123456",  
    database="python" 
)

cursor = db.cursor()

cursor.execute("CREATE TABLE IF NOT EXISTS FB_posts (id INT AUTO_INCREMENT PRIMARY KEY, content TEXT)")

for content in posts_content:
    sql = "INSERT INTO FB_posts (content) VALUES (%s)"
    val = (content,)
    cursor.execute(sql, val)

db.commit()
db.close()

while True:
    pass

成果截圖:

畫面01-GCP規格

畫面02-PY檔執行情況

畫面03-貼文資訊輸出TXT檔

畫面04-貼文寫入資料庫

CATEGORIES:

Comments are closed