Spark dataframe filter

Question

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.

So the output should be like below.

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

Can any one please help on this?

I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?

Version: Spark 1.6.2 Scala : 2.10

val df1 = df.filter(not(df("c2")==="MSL")&&not(df("c2")==="HCP")) I am trying something like this. — Ramesh
– Ramesh, Commented Mar 22, 2017 at 12:48
val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP"))) — Ramesh
– Ramesh, Commented Mar 22, 2017 at 12:49

Jegan · Accepted Answer · 2017-03-22 13:47:58Z

33

This works too. Concise and very similar to SQL.

df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

answered Mar 22, 2017 at 13:47

Jegan

1,7611 gold badge21 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pasha701 · Accepted Answer · 2017-03-22 14:42:48Z

13

df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )

answered Mar 22, 2017 at 14:42

pasha701

7,2171 gold badge17 silver badges22 bronze badges

Comments

Aliostad · Accepted Answer · 2019-05-15 13:38:20Z

7

I used below to filter rows from dataframe and this worked form me.Spark 2.2

val spark = new org.apache.spark.sql.SQLContext(sc)    
val data = spark.read.format("csv").
          option("header", "true").
          option("delimiter", "|").
          option("inferSchema", "true").
          load("D:\\test.csv")   


import  spark.implicits._
val filter=data.filter($"dept" === "IT" )

OR

val filter=data.filter($"dept" =!= "IT" )

edited May 15, 2019 at 13:38

Aliostad

82k21 gold badges164 silver badges209 bronze badges

answered Feb 26, 2019 at 5:58

Priyanshu Singh

7248 silver badges12 bronze badges

1 Comment

Priyanshu Singh Over a year ago

for null filter use val filter=data.filter($"dept".isNotNull )

Ramesh · Accepted Answer · 2017-03-22 12:50:19Z

0

val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))

This worked.

answered Mar 22, 2017 at 12:50

Ramesh

1,5939 gold badges27 silver badges39 bronze badges

2 Comments

pheeleeppoo Over a year ago

Using rlike in this way will also filter string like "OtherMSL", even if it does not start with the pattern you said. Try to use rlike("^MSL") and rlike("^HCP") instead. Alternately you can also use the .startsWith("MSL") function.

chandra prakash kabra Over a year ago

You can use not equal like below example. df.filter(df("status") === "1" && df("SubType") =!= "Test")

Collectives™ on Stack Overflow

Spark dataframe filter

4 Answers 4

Comments

Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

2 Comments

Linked

Related