HTML Parse in Kotlin with jSoup

HTML Parse in Kotlin with jSoup

On numerous occasions, I have had a need to parse HTML utilizing Java – And I have despised it. Somewhat in light of much better instruments in different dialects.

As of late, I had one such need. I expected to,

  • Bring HTML reaction from a URL
  • Parse it and rub data from it
  • Dump it some place
  • Do the majority of the above utilizing Kotlin

Parse HTML in this day and age? Tragically, there was no known open API that would return JSON or XML or something different. The data was just accessible as HTML and best way to understand that data was to parse and rub it.

In view of that, I went and paid special mind to libraries accessible to parse HTML utilizing Java or Kotlin. I discovered jsoup.

Its a pleasant lightweight library to parse certifiable HTML. jsoup API is pretty much like jquery API – Which makes it a joy to utilize. Without squandering much time gives simply hop a chance to directly into code.

How Do They Do It!

Lets state, we simply had a basic necessity, parse the Google Search Result Page and rundown all the outcome title’s and URL’s.

NOTE: I realize that google exposes look API to return JSON reaction, yet for this model simply expect it didn’t have any such API.

jsoup can be incorporated by means of numerous ways. Here’s the manner by which we could incorporate it by means of a gradle document.

group 'com.xx.blog.using-jsoup-with-kotlin'
version '1.0'

buildscript {
    ext.kotlin_version = '1.2.21'
    ext.spring_boot_version = '1.5.6.RELEASE'

    repositories {
        mavenCentral()
    }
    dependencies {
        classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
        classpath("org.jetbrains.kotlin:kotlin-noarg:$kotlin_version")
    }
}

apply plugin: 'java'
apply plugin: 'kotlin'

sourceCompatibility = 1.8
targetCompatibility = 1.8

jar {
    baseName = 'using-jsoup-with-kotlin'
    version =  '1.0'
}
repositories {
    mavenCentral()
    jcenter()
}

dependencies {
    compile "org.jetbrains.kotlin:kotlin-stdlib-jdk8:$kotlin_version"
    compile 'org.jsoup:jsoup:1.10.3'
    testCompile group: 'junit', name: 'junit', version: '4.12'
}

compileKotlin {
    kotlinOptions.jvmTarget = "1.8"
}
compileTestKotlin {
    kotlinOptions.jvmTarget = "1.8"
}

Next up, lets compose a basic test, it will do the majority of the previously mentioned things. Here are the applicable parts of the code.

import org.jsoup.Jsoup
import org.junit.Test

class JsoupTest {

    @Test
    fun shouldParseHTML() {
        //1. Fetching the HTML from a given URL
        Jsoup.connect("https://www.google.co.in/search?q=this+is+a+test").get().run {
            //2. Parses and scrapes the HTML response
            select("div.rc").forEachIndexed { index, element ->
                val titleAnchor = element.select("h3 a")
                val title = titleAnchor.text()
                val url = titleAnchor.attr("href")
                //3. Dumping Search Index, Title and URL on the stdout.
                println("$index. $title ($url)")
            }
        }
    }
}

This prints out the pursuit file, title and URL from the output page. Here’s the example yield

0. Tomorrowland Belgium 2017 Armin van Buuren - This Is A Test ... (https://www.youtube.com/watch?v=Xvucf7jON3g)
1. Armin van Buuren - This Is A Test (Arkham Knights Extended Remix ... (https://www.youtube.com/watch?v=OwDyXmEUwEM)
2. This Is A Test (Arkham Knights Extended Remix) by Armin van Buuren ... (https://www.beatport.com/track/this-is-a-test-arkham-knights-extended-remix/9731575)
3. Armin van Buuren – This Is A Test Lyrics | Genius Lyrics (https://genius.com/Armin-van-buuren-this-is-a-test-lyrics)
4. Armin van Buuren - This Is A Test by Armin van Buuren ... - SoundCloud (https://soundcloud.com/arminvanbuuren/armin-van-buuren-this-is-a)
5. This is a test - Wikipedia (https://en.wikipedia.org/wiki/This_is_a_test)
6. This Is Not a Test! - Wikipedia (https://en.wikipedia.org/wiki/This_Is_Not_a_Test!)
7. This Is A Test (Remixes) by Armin van Buuren on Spotify (https://open.spotify.com/album/71cR7uBt5yYTsDIRIJQfOt)

That is about it!

Leave a Reply

Your email address will not be published. Required fields are marked *