How To Find Duplicates In A List

Question

You can use iteration_utilities.duplicates:

          >>> from iteration_utilities import duplicates  >>> list(duplicates([one,1,2,i,2,3,4,2])) [1, ane, 2, 2]

or if y'all only want one of each duplicate this tin can exist combined with iteration_utilities.unique_everseen:

          >>> from iteration_utilities import unique_everseen  >>> listing(unique_everseen(duplicates([1,1,2,1,2,3,4,ii]))) [1, 2]

Information technology can as well handle unhashable elements (however at the toll of performance):

          >>> list(duplicates([[1], [two], [1], [iii], [one]])) [[one], [1]]  >>> list(unique_everseen(duplicates([[ane], [two], [1], [3], [one]]))) [[1]]

That'southward something that simply a few of the other approaches here can handle.

Benchmarks

I did a quick criterion containing well-nigh (but non all) of the approaches mentioned here.

The offset benchmark included merely a small range of list-lengths because some approaches have O(n**2) behavior.

In the graphs the y-centrality represents the time, and so a lower value means better. Information technology'due south also plotted log-log so the wide range of values tin be visualized improve:

enter image description here

Removing the O(due north**2) approaches I did some other benchmark up to half a one thousand thousand elements in a listing:

enter image description here

Equally yous can meet the iteration_utilities.duplicates arroyo is faster than whatever of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

Ane additional interesting affair to notation here is that the pandas approaches are very boring for small lists but can hands compete for longer lists.

However as these benchmarks evidence most of the approaches perform roughly equally, so it doesn't thing much which i is used (except for the 3 that had O(due north**2) runtime).

          from iteration_utilities import duplicates, unique_everseen from collections import Counter import pandas every bit pd import itertools  def georg_counter(it):     render [item for detail, count in Counter(it).items() if count > one]  def georg_set(it):     seen = set()     uniq = []     for x in it:         if x non in seen:             uniq.append(x)             seen.add(x)  def georg_set2(information technology):     seen = set()     return [10 for x in it if 10 not in seen and not seen.add(x)]     def georg_set3(information technology):     seen = {}     dupes = []      for ten in it:         if x not in seen:             seen[x] = ane         else:             if seen[x] == 1:                 dupes.suspend(x)             seen[x] += 1  def RiteshKumar_count(l):     render set([x for ten in l if l.count(ten) > 1])  def moooeeeep(seq):     seen = set()     seen_add = seen.add     # adds all elements it doesn't know notwithstanding to seen and all other to seen_twice     seen_twice = ready( x for ten in seq if ten in seen or seen_add(ten) )     # turn the fix into a list (every bit requested)     return listing( seen_twice )  def F1Rumors_implementation(c):     a, b = itertools.tee(sorted(c))     next(b, None)     r = None     for k, g in nil(a, b):         if k != 1000: continue         if k != r:             yield k             r = k  def F1Rumors(c):     return list(F1Rumors_implementation(c))  def Edward(a):     d = {}     for elem in a:         if elem in d:             d[elem] += i         else:             d[elem] = 1     return [x for x, y in d.items() if y > 1]  def wordsmith(a):     return pd.Series(a)[pd.Series(a).duplicated()].values  def NikhilPrabhu(li):     li = li.copy()     for x in fix(li):         li.remove(x)      return list(set up(li))  def firelynx(a):     vc = pd.Series(a).value_counts()     render vc[vc > 1].index.tolist()  def HenryDev(myList):     newList = set()      for i in myList:         if myList.count(i) >= 2:             newList.add(i)      return listing(newList)  def yota(number_lst):     seen_set = gear up()     duplicate_set = gear up(x for x in number_lst if x in seen_set or seen_set.add(x))     return seen_set - duplicate_set  def IgorVishnevskiy(l):     s=set(l)     d=[]     for x in l:         if x in s:             due south.remove(x)         else:             d.suspend(x)     render d  def it_duplicates(l):     render list(duplicates(50))  def it_unique_duplicates(l):     return list(unique_everseen(duplicates(50)))

Benchmark i

          from simple_benchmark import benchmark import random  funcs = [     georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep,      F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,     HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ]  args = {2**i: [random.randint(0, two**(i-1)) for _ in range(2**i)] for i in range(2, 12)}  b = criterion(funcs, args, 'list size')  b.plot()

Benchmark two

          funcs = [     georg_counter, georg_set, georg_set2, georg_set3, moooeeeep,      F1Rumors, Edward, wordsmith, firelynx,     yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ]  args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(ii, 20)}  b = benchmark(funcs, args, 'list size') b.plot()