[Crawling] Xpath를 이용한 인터넷 기사 수집하기

(2021년 6월 23일 기준) JTBC 사이트에서 ‘아이돌’ 검색어에 대한 기사 수집

from bs4 import BeautifulSoup # HTML 문서 분석 라이브러리

from selenium import webdriver #브라우저 제어

import selenium

import pandas as pd #결과값을 데이터프레임 객체로 저장하기 위해 이용

import time #페이지마다 일정한 휴식을 주어 크롤링 중의 오류 발생 예방

driver = webdriver.Chrome('C:/chromedriver.exe') #chromedriver.exe 위치 임의 지정

driver.get('https://jtbc.joins.com/search/news?term=%EC%95%84%EC%9D%B4%EB%8F%8C')

#데이터를 수집하고자 하는 url 지정

def get_url(): #사용자 정의 함수

url_list = [] #url_list 생성

for i in range(10):

newsTitleXpath = '//*[@id="content"]/div[3]/div[1]/div[2]/ul/li['+str(i+1)+']/h3/a'

#기사 제목 Xpath 형식 지정

title = driver.find_element_by_xpath(newsTitleXpath)

href = title.get_attribute('href')

#기사 제목에서 href 항목만 추출

url_list.append(href) #url_list에 기사 제목 데이터 삽입

return url_list

urls = get_url() #url_list 이름 지정

#JTBC 뉴스 사이트 pageButton Xpath

#1~2p 간 : //*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[2]/a')

#2~3p 간 : //*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[4]/a')

#3~4p 간 : //*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[5]/a')

#4~5p 간 : //*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[6]/a')

#5p 이후 : //*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[7]/a')

#1p에서 기사 크롤링 후 2p로 이동

pageButton = driver.find_element_by_xpath('//*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[2]/a')

#페이지 이동 버튼 형식 지정

pageButton.click()

time.sleep(1)

#크롤링 시 웹페이지로부터의 제어를 예방하기 위해 일정 휴식 시간 삽입

print('number of news urls 1p:', len(urls))

#'number of news urls 1p : ‘1p에서 크롤링한 기사 url 개수’

#2p에서 기사 크롤링 후 3p로 이동

pageButton = driver.find_element_by_xpath('//*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[4]/a')

pageButton.click()

time.sleep(1)

urls += get_url() #url 누적

print('number of news urls 1~2p :', len(urls))

#'number of news urls 1p : ‘1~2p에서 크롤링한 기사 url 개수’

#3p에서 기사 크롤링 후 4p로 이동

pageButton = driver.find_element_by_xpath('//*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[5]/a')

pageButton.click()

time.sleep(1)

urls += get_url()

print('number of news urls 1~3p :', len(urls))

#'number of news urls 1p : ‘1~3p에서 크롤링한 기사 url 개수’

#4p에서 기사 크롤링 후 5p로 이동

pageButton = driver.find_element_by_xpath('//*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[6]/a')

pageButton.click()

time.sleep(1)

urls += get_url()

print('number of news urls 1~4p :', len(urls))

#'number of news urls 1p : ‘1~4p에서 크롤링한 기사 url 개수’

#5p에서 기사 크롤링 후 6p, 7p, 8p, 9p, 10p로 이동

for i in range(6):

pageButton = driver.find_element_by_xpath('//*[@id="content"]/div[3]/div[1]/div[2]/div/ul/li[7]/a')

pageButton.click()

time.sleep(1)

urls += get_url()

print('number of news urls 1~10p :', len(urls))

#'number of news urls 1p : ‘1~10p에서 크롤링한 기사 url 개수(100개)’

from selenium.webdriver.support.ui import WebDriverWait #로딩 완료시까지 대기

text_list = [] #text_list 생성

for url in urls: #urls의 url에 대해 본문 검색

driver.get(url)

time.sleep(2)

text = driver.find_element_by_xpath('//*[@id="articlebody"]/div[1]') #기사 본문 Xpath

text_list.append(text.text) #url_list에 본문 데이터 삽입

len(text_list) #크롤링한 기사 개수

df = pd.DataFrame(text_list, columns=['text']) #text_list로 데이터프레임 객체 생성

df.to_csv('idol.csv',index=False) #csv 파일로 저장

※ 오류 관련 댓글은 환영

'파이썬' 카테고리의 다른 글

[Topic Modeling] 토픽 모델링을 통해 기사 주제 분석하기 (0)	2021.07.03
[Word Cloud] 형태소 분석을 통해 워드 클라우드 생성하기 (0)	2021.07.03

순희

[Crawling] Xpath를 이용한 인터넷 기사 수집하기

'파이썬' 카테고리의 다른 글

티스토리툴바

[Crawling] Xpath를 이용한 인터넷 기사 수집하기

'파이썬' 카테고리의 다른 글

'파이썬' Related Articles

티스토리툴바